Domain-Driven Design and microservices have changed the way software engineers work. And I think, they can be used to multiply the productivity of data teams as well. But I also think, they have to be used slightly differently because the common centralized data team is in a very different situation than the common (decentralized) software dev team.
This article is about exploring this very different way of working with your central data system. It offers a simple iterative way, just like microservices, to start with the default monolithic one pot to capture it all. It then lets you slowly break out pieces once you realize the time is ready. But you’ll have to throw away some of your biases because, just as in the microservices world, you will end up with duplicate data and duplicate work, by intention.
I’m trying to do something different than the data mesh approach here, the data mesh approach would try to get your data team to just “own one domain”. Whereas this article is about acknowledging that the data team simply has a lot of domains from other teams inside their data systems.
In applying this approach, we are aiming to increase the flexibility to quickly iterate on our most valuable data applications, without carrying the weight from all the other parts of the system. Be that a central data lake, a bunch of reports, shared data models, or anything else we share and thus have dependencies on. This flexibility helps us to focus on the 20% of most valuable parts of our system with 80% of our effort, thus delivering a lot more value over time.
You, the new data developer inside Streaming Inc. just created a new fanzy North Star dashboard for the CEO. He’s been asking for that for weeks and he’s excited!
You added it to your central data systems and reused the data sources you already got staged. Great idea, this way you only have to produce load on the production systems once, and the data are coherent with the other dashboards in your system.
What we’ve created here is a small monolith. That’s fine as we’re just starting out, but let us now explore what usually happens next…
The CEO: “Ah, I love the dashboard, but I need it on an hourly schedule. ”
The problem? One of the other sources is pretty slow, you’re not supposed to call the production Oracle database more than once each day, at least not for multiple tables…
But wait, just as you’re trying to figure this problem out, you get another call.
The CEO again: “Oh no, the dashboard doesn’t work! What’s up here?”. You turn up the dashboard on indeed, it says something about an error in some “mkt_channel”.
Turns out, the Marketing team changed something with the “mkt_channel” attribution they use in Google Analytics, which caused the model to crash, even though technically you don’t need this in the North Star dashboard.
Just as you’re about to commit the changes to add the new channel to the system, you get the third call.
The CEO one more time: “Btw, I got these 10 changes for you to make, can you please make that happen now?”.
While you’re working on all these new changes, you realize, you haven’t checked out the other reports & dashboards. Turns out, you broke a bunch of them while working on this important dashboard.
And again, you’ll have to fix lots of stuff, add a bunch of tests, and run all tests together for every model because you got this one now rather large monorepo.
… The dependencies are causing trouble, a lot of trouble because things are entangled. This is not a problem if the business value derived from the data system is evenly distributed, but in this case, it is not – and it usually isn’t! The CEO actually does derive very important insights from his dashboard. So, let’s try to untangle them, to quickly work on the dashboard (& possibly other additional things in the future) with full power.
Cutting Stuff Down Into Apps/ Domains
This entanglement happens to software engineers all day long. Do you know what they do? They break out a microservice or a domain, isolate the valuable part, and iterate quickly on it.
By doing so, we gain the flexibility of changing just one coherent business component or app, but increase the complexity inside the complete system. That’s because we’re going from one monolithic DAG to two separate DAGs, or apps, or whatever you choose to bundle this business logic into.
This is how we can cut the system into smaller pieces.
Notice, by breaking up we are importing a few things multiple times, once in each component, but we are able to reduce the “columns” we are importing (or attributes,…).
By doing so, we gain things, and we lose things.
What Do We Gain?
We gain flexibility that is aligned with value creation. That means, for instance:
1. If one of our imports breaks, it’s the only piece that breaks. Only one app breaks, the other keeps on working. So in our example, our North Star dashboard is not impacted by anything breaking inside the reports.
2. We are able to modify both things separately. We don’t have to “run all tests”, we have a much smaller unit to work on. This means, working on one unit locally is much faster because it only requires specific knowledge of this part.
3. We are able to attach very different levels of “SLAs” to our apps. It is now pretty easy to increase the update frequency of our north star dashboard because it’s broken down only to the essential data pieces.
4. If we look locally, that means inside each component, we import way less data than before. We don’t need the same columns/attributes for both apps, we are able to select. We are even able to implement different data load logics, updating correctly depending on the context. In our example, our North Star dashboard doesn’t care at all about changing user names, so it doesn’t need that in the update process.
5. We are able to choose the right tools for the right job. We could for instance decide to model our dashboard data inside a simple Python function while keeping the reports in a more standardized dbt project.
What Do We Lose?
We mainly lose the simplicity of the whole system. We duplicate things and make them more complex.
1. We import data more than once.
2. We model some data twice, intentionally. That might be because we truly want to model it twice in the same way and haven’t yet created an abstraction to handle that (because it’s not needed yet), or it might be because we actually want to model it differently because it really depends on the context.
3. We import & model differently, which means the data is not the same in the two apps. And it’s not supposed to be!
4. We need to apply extra care because we don’t want to produce a technology zoo, which means we’ll have to put some abstractions in place here and there in our tooling.
App/ Domains/ Microservices/ DAGs?
I like the central system as the default option. Once you notice, a certain area seems to be requested by the same people again and again, or it seems to be much more important than others, it’s time to break things out, to gain flexibility & speed.
I propose to let the data teams use consumption data domains as the main seams to cut across. The results might become separate DAGs inside an orchestrator, they might become complete “apps” inside a consumption domain or you could call them microservices.
To start to separate you can either consult your company’s domain landscape or do some business capability mapping and identify a few domains on your own. As a result, you might get domains like “find potential customers” which might be very marketing-related, or “increase customer retention” or anything like that.
If you realize you need to import one data source too many times, and the source is not willing to accommodate for that, you can still build an “ingestion service” just for one data source. In this case, you’d be implementing a source-oriented domain.
But if you do so, think about it as a true microservice meaning build-in fault tolerance, and possibly a light version of “caching” on both sides.
Finally, I don’t think it’s important how you deploy the separate pieces, whether as separate DAGs, python apps, or anything like that, these ideas apply to any means of deployment.
I hope it did not escape your attention that this way of designing things is much more future-proof. It enables you to turn parts of your data, the one contained into a well-defined context, into a machine learning application, serving it into a CRM system….