Notes from Industry
How to apply the true Data as Code philosophy to achieve close to zero production defects using the tried & true methods from software development on data.
Data teams spent close to 60% of their time on operational things, not producing value. They also experience a large level of bugs in their data systems, according to the datakitchen study & Gardeners survey. Yet, in the software development world, we already have the philosophies in place that allow high-performing teams to deliver both quickly and at a high level of quality, without any of these problems.
So why don’t we carry over the exact same practices? After all, they originate in the lean manufacturing world and already got carried over to software. This is what should be called “Data as Code”.
Let’s see how the tools & practices from the software world transfer over to a simple example, an EL (T) workflow and enable our exemplary team to:
- Reduce Mean Time To Recovery (MTTR) to just minutes (or seconds).
- Increase frequency of “data deployment” & thus tests
- Change Fail Percentage (relative number of production errors) to around 0%
If you’re wondering why we want to look at these three metrics, these are three of the four metrics that distinguish high-performing teams from average ones. Sounds like a good goal right?
Warning, this is true “data as code”, not just “ah right, please version the data. It will look quite a bit strange, that’s why I’m using an example. To highlight why this data production flow will produce close to 0% production defects.
Our Example, Data Ingestors Dan, Dave, Dana
Let me introduce three friends: Dan, Dave, and Dana. All three of them are data ingestors. Machines. Dan & Dave are instances of a “REST Ingestor” and Dana is an instance of a “Salesforce API Ingestor”.
All three of them are allowed to work with the data git, a git for data that could be e.g. hosted by a solution like lakeFS. They can work with it, just like any human developer would with their codebase.
Dan, Dave & Dana are excellent developers. They adhere to all the good principles of Continuous Integration, commit often, test locally, and so on …
Side Note: Why am I mentioning lakeFS? Because as other tools, lakeFS operations are not actual duplications of the data, but mostly metadata operations. As such, the copy operations mentioned below actually only take seconds, even for terabytes of data.
The EL (T) Data Flow
Our data flow is very simple. We take one REST API for “Orders”, and the Salesforce API, join these two sources together using some common transformation tool, and get some statistics on customers segmented by typical customer-related data.
Our Reference Workflow Performance
Our reference workflow is simple:
- The data ingestors are set to ingest data every hour.
- The REST ingestors have only one source, but it’s big so it’s parallelized to N=2
- The ingested data lands in a “raw table” and is tested right after the run using a tool like great-expectations.
- If there’s an error in one of the raw sources, only that part is halted and a developer has to check on it and possibly fix it.
If something breaks in production, e.g. a transformation that creates the “customer table” (yes I know we could have a materialized view on top of it) then we just lost our customer data. Recovery means deleting data/fixing some data issue and then rerunning a transformation => measured in 30 mins to an hour. In the average company, this happens a couple of times a month.
That already describes the metrics mentioned above in a typical data setup.
Now, let’s see how the smart versions of Dan, Dave & Dana perform differently than our reference workflow!
— — — —
Dan, Dave & Dana …
Dana Commits Salesforce API Customer Data
Dana knows the key to good data is “committing often”. This way errors get caught quickly and are contained in a small “blast radius”. So she runs every 5 minutes.
Before she “runs”, she clones the “data repository” into her local system (not really local, but let’s keep the analogy), by creating a branch in lakeFS “DanaRunX2021–01–01–101” which is protected and only visible for her.
When she runs, she ingests data and then commits it to her personal branch.
Instead of pushing the data into the raw table of the production system, she keeps them “local” for a minute and runs tests using this cool tool called “great-expectations” which runs a bunch of basic tests like “non-nulls”, etc. and checks the distribution of the raw data for outliers and changes.
First Improvement Local Testing
What happens when a test breaks? Dana can’t do anything about it (…yet, after all, she’s not a machine learning system or something…), so it simply stays that way, and the human developers get a notification that DanaRunX2021–01–01–101 failed.
The human developer can now check the branch out, and investigate the problem without ever interfering with the production data.
What happens to the production data?
- All systems keep running.
- In another 5 minutes, Dana takes the next chunk of data to DanaRunX2021–01–01–102 and runs her new tests.
- If the data bug ain’t fixed yet, the data is not in the master branch, so the tests go through and the new data is there.
- Everyone is happy.
Feels kind of like a staging area where corrupt data just isn’t processed right?
What happens then?
Pushing Onto Master
Dana then goes on to push her data into the original repository. A merge request is automatically created & approved, the tests run again, and the data is pushed into the regular promotion order.
Let’s call the data a dartifact, because it’s now really a “data artifact” or more like a “dart arrow” that goes from the production environment back to some central git repository together with a CI server.
Onto the Stage Environment
The next step is NOT the production environment, after all, Dana is a great developer. She knows she has to integrate her data into the other data, which she just did, but also into the other components.
So the data once pushed into the central repository, is promoted to a staging environment. This staging environment also serves for testing and integration of all software components. As such, it has the latest version of our data transformation tool running.
Now our CI server can execute a bunch of integration tests, meaning running the data transformation tool and executing its tests.
For this workflow to function with lakeFS we got to abuse the branching mechanism a bit, and call at least one branch “env-staging”, and give it special access right protecting like it’s only accessible via the integration components, etc. Of course this could also be handled on the lakeFS API level, which would probably be the best way.
Second Improvement Integration Testing in Staging Environment
What happens when a test breaks?
The developers get notified on the regular information radiator at the display which any team using CI should have at a central place.
Stuff gets fixed, and again we have no problems in our production system. Other data just flows past the buggy part.
Most important, we’re also integrating the latest dbt models against the new data, before running either into production.
Wait, But Our Data Is Too Large To Run Transformations Twice
What do we do, if the transformation takes too long? There are multiple options then!
We already coined the data “dartifact”. This means, we can treat it like any other software artifact which goes through our promotion order. We could for instance commit our data together with some kind of configuration that only together form the dartifact.
A possible configuration could be: A sampling mechanism on the real data keeping the distribution intact, but still, make it much smaller in the staging environment.
Or it could be a data mask masking personal information if needed on the staging environment.
Promote to Production
Finally, Dana can promote her data to production. Thanks to the awesome automation available today, this actually only took a few minutes.
But since Dana is still a smart developer, she doesn’t promote straight to the production environment. Instead, she does a “Blue-Green Deployment’’, meaning she pushes her new dartifact into a second branch in lakeFS called “production-blue”.
She then runs a bunch of smoke tests on the data and only then slowly switches the production traffic from “production-green” to “production-blue”.
Third Improvement Prod Smoke Tests and Instant Rollback
What if something breaks? If a smoke test fails, all is good, no production traffic is affected.
If it should truly break, then we still have a super easy rollback option which we can implement in any CI on top of lakeFS. It’s to simply redeploy either the old dartifact to our production branch or to switch the production traffic back to “production-green” which was the old working branch.
We still got to roll back the master branch to protect future merges from killing the production system again.
What About Dan and Dave?
Dan and Dave follow the same workflow. Except, someone forgot to put the “local testing” option into their system.
A problem? No, because once they commit their data to master and push it to the original repository, it gets integrated anyways and thus tested. Someone will catch it, implement the local testing and be happy.
Additionally, since the two work in parallel, data really needs to be integrated and tests have to be run again in the central repository where their two data versions are merged together.
Look at the Numbers
Take a look at what this workflow accomplishes, compared to our reference workflow:
- Mean Time To Recovery: in the reference workflow iit s 30mins to an hour. For us, if data really breaks, we simply have to switch a branch in lakeFS. It’s a metadata operation meaning it takes seconds to get to a running system. Because of intensive testing beforehand, and the small chunks, fixing is also extremely quick.
- Change Failure Percentage: since we got 4 test layers in our new workflow, compared to 1–2 in the reference workflow we decrease our Change Failure Percentage to close to 0.
- Frequency of Data Deployments: in our reference workflow is once per hour, here it’s every 5 minutes. And yes it’s true, it could be 5 minutes also in the reference workflow if we let it be. Going to 5 mins means decreasing the time it takes to fix something because the possible error area is much smaller.
What About The Transformation Tool?
Of course, now that we got data as code, just as with infrastructure as code, we’ll have to integrate things together. That means the dbt tool, once a change is committed, will also trigger the integration tests against the data.
The same goes for new transformations pushed into the repository.
Ain’t this Crazy Expensive & Slow?
As mentioned above, the cost is adjustable and relative. It’s also pretty clear that there’s no point in actually moving around the data all the time. Instead, a tool like lakeFS or any other similar tool is perfectly suited because these tools work with metadata operations which move no data at all.
Of course, just as with usual software development for every situation there is a sweet spot between commit size and the interval to get the metrics that satisfy what we want for our data.
There might still be a need for some technological innovations like a better CI workflow for data, but the process can already be implemented well with the technologies that are already out there.
True, if your goal is to get the data out as cheaply as possible, then this workflow is not what you’re looking for….
but given that “data is the most valuable asset of the world”, I’m pretty sure a little more money spent on delivering quickly and in high quality is worth the investment.