No, DaC is not just versioning data! It’s applying the whole software engineering toolchain to data. For that, we need principles.
This post is part of a small series beginning with: Data as Code — Achieving Zero Production Defects for Analytics Datasets.
Data as Code is a simple concept. Just like Infrastructure as Code. It just says “Treat your data as code”. And yet, after IaC appeared on the ThoughtWorks Radar in 2011, it still took roughly 10 years to “settle in” and is still on an uneasy spot where IaC advocates feel they need to remind people of the following:
“ …. Saying “treat infrastructure like code” isn’t enough; we need to ensure the hard-won learnings from the software world are also applied consistently throughout the infrastructure realm”
So much for that. Since I think we shouldn’t wait another 10+ years to get high-quality data applications fast, I wrote this article (series).
So what is DaC?
Data as Code: using the same good practices we learned in software and applying them to data in all fields of data, operational data, analytical data, data science data,… versioning, automated testing, CI systems,… with the goal of delivering high-value data fast.
My fear is, Data as Code (DaC) will take even longer than IaC to take off, with a huge welfare loss for all of us. I think three important concepts are missing from the little discussion that happened so far around the concept of DaC. The three concepts are “dartifacts”, “human aided machine engineering” and the “integration² of data”.
I’ll explain the concepts and why they are so important and explain the principles we should carry over to the world of data.
And even though it’d probably be reasonable to talk about all the ways the “old model fails”, I like to focus on the new world, the new concepts, and the new opportunities.
What happened? Old Ways in a New World
This happened so far, and will happen over the next years…
Data in our applications is growing exponentially. It’s not just utilized operationally to store objects, “remember things” but also analytically to analyze things, make better decisions. Thirdly it is used in an “analytically operational” way powering machine learning applications. The latter two applications produced a Cambrian explosion of data uses.
So both, the amount of data available, and the value of data are on an exponential growth curve.
In my eyes, this will lead to an era where data will be the electricity of the future, powering simply everything.
The problem: We keep on working in old ways, in this new world. Whereas we should apply the golden standard of manufacturing techniques we already learned to apply to software to data so well.
Concepts in the DaC World
People talk about “data as code” but so far, I haven’t seen anything that’s close to what must happen in my mind. What already is moving in the right direction is the understanding that we need to version data (see e.g. tools like dvc or lakeFS) and test it (see monte carlo or great-expectations).
I think it’s because three important concepts are missing from the discussion around data as code. Let’s take a look at them.
Software artifacts are things created in the software development process. Usually, a developer develops some code, pushes it into a central repository, then into a CI/CD system where it gets built. Then it’s delivered to different “stages” for testing and finally gets promoted to production.
A dartifact on the other hand is also an artifact, but not of the software development process in this sense. It’s something that is created by the system which is deployed in any of these stages and that gets sent back to the central place, like a dart. A special example of course is data.
Note the artifact created by the human is kind of like a set of “guardrails’’ whereas now the dartifact is a self-evolving system. This brings us to the next point, the software engineering of the future.
Human Aided Machine Engineering
So, a service is supposed to commit data? That’s great, but what if something breaks, a machine cannot fix things on its own, right? Actually, we’re already using the right workflow for that exact problem in the framework of “CD4ML”.
For continuous/ online learning machine learning models ThoughtWorks advertises a somewhat similar workflow. It looks something like this:
- We set up a testing & evaluation pipeline in our CD system for our machine learning model.
- We introduce tests like “if the new model doesn’t beat the last one, don’t use it, otherwise automatically deploy it”.
- We deploy a new machine learning model. We also deploy the data collector with it.
- Whenever there’s a batch of new data (say a day’s worth) the data collector commits the data to git, and triggers the pipeline, evaluates a new model, and possibly publishes it.
Of course, when the pipeline fails, a machine learning engineer gets notified and can work on the model. Because after all, if it gets worse with more data, it means something is likely to be broken. But he now has the chance to simply provide better guardrails and let the system handle things from thereon and thus continuously improve itself.
That is human aided machine engineering, machines doing “development” where humans come in and do the “heavy lifting”. So far I haven’t seen that topic covered in any of the discussions but I think it’s an important one because I don’t see any other future than exactly that for all areas of development.
Ever thought about that? Why in the world would you as a developer solve a problem that has already been solved 10000 times before, and is publicly available somewhere in the millions of lines of code online? That sounds very much like something a machine should do…
Before Infrastructure as Code, there was just software. Now there were two additional integration points. The question of course is, since the infrastructure simply exists to run the software when I treat it as code, how do I go about the software? But that problem has only two options:
- Take infrastructure + software as one unit
- Take them as separate units
For Data as Code, on the other hand, I feel like we got a lot more options, simply because it’s the third in the round, and because it is actually used in a variety of ways.
Three Practical Data Types
I practically see three different “data types” or “data application categories” in use today. The operational data which the main purpose is to remember things. Then there is the analytical purpose where its main purpose is to help humans make better decisions. And finally the analytically operational, which usually is treated just as analytical as well, whose purpose is to automatically help someone make better decisions (be that in a recommendation system which helps you make better movie choices, or in an actual decision support system with forecasts, etc.).
I like to separate the last two simply because only analytically operational data apps have so far been close to data as code as I understand it using a CD4ML/CML workflow.
Principles for DaC
The principles in DaC are almost all 1–1 translations from principles of good modern-day software development. Yet if we look at the data world, they aren’t used at all, in any of the three data areas outlined above…
Why should we do so? Because most of these principles originally come from the lean manufacturing world, where the focus is simply on increasing both output & quality of a flow at the same time. It’s what made Toyota, and it’s what made most modern tech giants. So why should we ignore what works so well if we care about the data flow?
Enter, the principles.
(1) Everything in Version Control
Sounds like a no-brainer, and it’s the one thing that is often called “data as code” but as you can see, it just covers 5% of what Data as Code is.
The principle: Keep your data in version control. Everything about it.
(2) Small Commits — Small Pieces of Data
For some reason, we “batch data” into huge batches. Yet from the software world we already know that if we commit something big, our chances of breaking something are very large, and the effort to fix it is large as well.
The principle: commit small DaC changes.
This is a change in mindset. Whereas before we retrained our machine learning model with new data once we got lots of new data, we now advocate for constant retraining.
Keep in mind the trade-off between speed and size but that’s about it. I think usually people are afraid of “committing data often” because they think it will break more often. But the opposite is the truth, the smaller the new/changed dataset, the easier it is to fix it.
This of course only works, if “committing/ingesting/… often” does not lead to a broken production system, so let’s take a look at the other principles.
(3) Local Unit Tests by the Creator Of The Code
We all run unit tests on our code before pushing it to a central repository because we don’t want to commit broken code.
Yet data ingestors and other systems routinely ingest data right into the production system, without any tests.
The principle: The Creator of the DaC dartifact also tests it “locally”, before pushing it to a central system.
As an example, this would mean, your online machine learning system will run a test e.g. checking for its statistical distribution. If that fails, it will not push the data into the system for training but rather put the new data piece “on the side”.
(4) Walking the Promotion Order
Software builds get deployed to a sandbox first, an integration environment later, and finally to a production environment.
Yet both data ingestors, and some online learning machine applications, as well as operational data all is dumped right into the production system.
The principle: DaC walks the promotion order, just like everything else.
That might sound strange for operational data, but I feel there is value here. Think about it, you’re probably taking a backup of your big production database. Why would you simply assume the data inside it is e.g. not corrupt? Why would you assume you can use it to restore without testing that out first on an integration stage? Why would you assume it still integrates with everything else?
(5) Same Data on All Stages
A corollary to that is this principle. We keep the same DaC on all stages. Usually, companies only keep a small sample or fake data in the non-production stages. I’m advocating for a 180 on this. Why? Because we want to test our dartifact in the same circumstances it will face in production, otherwise, we cannot rely on our tests and reduce the quality of our data, produce more bugs, and end up with slower data delivery time.
Of course, just as with software we can of course apply an environment-away configuration which could be:
- A “sampling” configuration on the lower stages or a
- Masking configuration to mask PII data
For the size-aware reader: There is no reason to physically make this happen. This doesn’t mean that we have to copy terabytes of data around all the time. This can happen as a metadata operation without any copying at all. A bonus: If we only use metadata operations, this should be GDPR compliant from my perspective (although I’m not a lawyer).
(6) Trunk-Based Development, Few Branches
The principle: We keep only 1–2 branches, including the trunk/master for our data.
This ensures that we actually enforce the principles above, like local testing & walking the promotion ladder.
(7) We Go With the Axis of Change
The principle: We cut our DaC, and data services along the axis of change, not orthogonal to it.
It means we try to cut things small, but rather at domain boundaries. We try to not separate the data ingestion from our new cool machine learning system, after all, they got to work together. It means we don’t cut our BI system into “ingest/transform/clean/store/save” bits, but rather into “business component Unit1/business Unit2/….” bits.
Why do this? First of all, following all the ideas of the data mesh concept, it’s the right thing to do. Second, if we channel the ingestion of data for our machine learning component through a large-scale data lake, we have next to no chance of testing the whole integrity of the system. We will have no chance of actually ensuring the quality of the actual value-producing unit, the machine learning component. We will just be able to ensure “piecewise” integrity, which is very different from value-producing unit integrity.
On the other hand, if we put all the pieces together into one architectural quantum, it’s pretty easy to ensure the complete integrity of the component and thus actually let it improve continuously all day long.
(8) Integrate Code Base & Components Often
The principle: Integrate your data often, together with the application that creates it, the one that stores it, and its infrastructure.
I think for anyone who has ever restored a data backup this should sound familiar: Things simply don’t work the same way after restoring a data backup, because the things around the data evolved, but the data took a step back. So it’s essential to integrate the data again and again with everything else, the infrastructure and the software components, just as we keep integrating all software components & infrastructure with each other.
(9-…) I’m sure there are more
What’s So Different About Data Isn’t more Prominent?
I think people are already approaching a mindset very close to this because data is becoming the most valuable resource on this planet, but the pipes to work on it haven’t caught up to that.
Two things are essentially different than what so far happens on Code as Code and Infrastructure as Code:
- Machines will do most of the committing here.
- The sheer size of data.
But then again, infrastructure is also pretty different from code, at least code just differs in size, infrastructure actually has a different appearance and thus needs extensive mapping of code to actual physical infrastructure.
And as explained above, I don’t think machines working on code are actually a new thing.
So maybe Data as Code is a mind shift, but I don’t feel like it’s a huge one, and the benefits in terms of enhanced quality & speed for this new most valuable good are huge.
An open question is still, do we need an abstraction layer on top of the data as we do in IaC? For instance for masking data? I don’t know, it remains to be seen what will work and whatnot.
What’s still missing — What do we need to develop?
I got excited about this topic when I stumbled over lakeFS. Because what I can see from here, it looks like they might head into a direction where they:
1. Are able to simulate distributed data by essentially,
2. splitting the read model of data from the storage of data.
Seems to me, it’s exactly what we need to make all principles work like a charm because as mentioned, the “size of data” doesn’t allow for frequent coping especially when we aim to speed up things, not slow them down.
This will be like the abstraction layer on top of the infrastructure which we use to do IaC.
What is still missing? Technically, everything is possible. But most things are not yet simple. It’s still hard to fork data inside a data warehouse or run data through a CI system.
Where should you go from here? Further Reading
If this caught your attention, then a good place to go is applied articles. For a read on an operational analytical data application, you should really read the piece by ThoughtWorks.
Since there was nothing in the analytical space, I took some time to write a companion article about using data as code to get to zero production defects in the data world.
I haven’t yet seen anything in the operational data world, but I’m sure there are smart people already doing exactly this to secure their backups, etc.
This is more of a thought experiment and a step into getting people to think about data as code than it is based on the experience of building working models. So if you have any examples of something like this in action, I’d love to hear about them!