Data Mesh Applied

Moving step-by-step from mono data lake to decentralized 21st-century data mesh.

(also check out the follow-up article: three kinds of data meshes)

Left: data lakes with central access, on the right: user accessing data from teams domain teams providing a great data product. (all images by the author)

How does a 21st-century data landscape look like? Zhamak Deghani from ThoughtWorks gave a beautiful and, for me, surprising answer: It’s decentralized and very different from what we see in almost any company currently. The answer is called a “data mesh”.

If you feel the pain of current data architecture in your company, as I do, then you want to move to a data mesh. But how? That is what I explore in this article.

But first, a short recap on the data mesh.


Twitter Summary of the data mesh

Modern software development needs a decentralized approach to data. Data must be considered a product by its generating team; They need to serve it; Analytics teams and software teams need to change!

Longer Summary

DDD, microservices & DevOps changed the way we develop software in the last decade. Data in the analytics department, however, did not catch up to that. To speed up decision making based on data in a company with a modern development approach, analytics & software teams need to change.

(1) software teams must consider data a product they serve to everybody else, including analytics teams

(2) analytics teams must build on that, stop hoarding data and instead pull it in on-demand

(3) analytics teams must start to consider their data lakes/ data warehouses as data products as well.


If the short summary appeals to you, let me walk you through how to actually get to a data mesh from your current starting point. We will walk through an example, pass by legacy monoliths, data lakes and data warehouses on our way. We move step by step from our “old” system to this new one.

SIDENOTE: Calling data lakes “old” might seem odd to you and it does to me as well. James Dixon, then CTO/ founder of Pentaho, imagined the concept of a data lake only 10 years ago. However, the central shifts in what surrounds data lakes, that is the software, DevOps, DDD, micro-services also only emerged in the last decade. So we do need to catch up, as the central all-mighty data lake is an answer to an old problem before those trends changed how we develop software at all. And besides, an all-mighty data lake is not what Dixon imagined in the first place.

We start with an example of our typical micro-service architecture for an e-commerce business.

  1. We show how this example looks like with a data lake/ data warehouse architecture (point A),
  2. then compare to a data mesh architecture (point C)
  3. then take that example, but add in a “data lake as a data node” (B), because this is really how we get from A to C.
  4. We consider the pain points that should kick off our move from A to C.
  5. We go step-by-step from A -> B -> C.
  6. We consider specifics of which parts to move first.
  7. We consider possible problems and how to deal with them.
  8. We consider one alternative approach to the problem.

An E-Commerce Micro Service Architecture with a Data Mesh Architecture

E-commerce business modeled with three operational microservices.

That’s a basic microservice architecture, with two domains, a “customer domain” with a customer API and a CRM system and an “order domain”, together with an order API. These services are operational services, they run the e-commerce website. These APIs let you create orders at the order API, customers at the customer API leads in the CRM system, check credit-lines and so on. They might be REST APIs, combined with some Event Stream, some Pub-Sub system, the specific implementation doesn’t really matter.

SIDENOTE: For us, orders and customers are different domains. That means the language in those domains might differ. A “customer” seen from team 2, the order team, has exactly one meaning, someone identified by a customer_id, that just bought something on the website. On team 1 the meaning might differ. They might consider a customer an entity from the CRM system, which can change the state from a mere “lead” to a “buying” customer, only the second one is known at the team 2 side.

Team 1 owns the customer domain. They know this domain by heart. They know what a lead is, how the transition states from leads to actual customers are and so on. On the other hand team 2 knows everything about the order domain. They know whether a canceled order can be restored, how the order funnel on the website looks like and much more. The teams might know a little bit about the other domain, but not all the details. They don’t own them.

Both domains generate a lot of data as a byproduct. Lots of people in the organization need those data. Let’s take a look at some of them:

  • The data engineer: Needs both order & customer data to do some transformation to generate OLAP cube base data, modular data; He also needs the data to test & understand it before he starts working on his transformations.
  • The marketing people: Need the overview of orders by item categories to expand their campaigns, dynamically every day.
  • The data scientist: is building the recommendation system and thus needs all the order data up-to-date all the time to train his systems.
  • The management: Wants aggregate overviews of overall growth.

The data lake/ data warehouse solution to those requirements will emerge as something like this.

A central team of data engineers would most likely be supplying all the data, via ETL tools or streaming solutions. They will have a central data lake or data warehouse, and a BI frontend up to use for marketing & management.

Data scientists might take data straight from the data lake which is probably the simplest way for them to access the data.

What possible problems do we see with this architecture?

  • This architecture creates a central bottleneck at the data engineering team
  • it will probably cause the domain knowledge to be lost somewhere on the way through its central hub,
  • and make prioritization of all those different, heterogeneous requirements hard.

So good so far. What about the data mesh approach?

Here is the same e-commerce website with a data mesh architecture.

Green: new data-APIs. Bottom: Mgmt with straight BI tool access, marketing with data form data-API, left: data scientist with data from data-API

What changed? For starters, data scientists & marketing people can access data from the source domain! But there’s much more.

SIDENOTE: The key to the data mesh architecture is to get the data DATSIS. Discoverable, addressable, trustworthy, self-describing, inter-operable & secure.

I’ll mention the points below.

Let’s walk through the points step-by-step

  • The customer domain: The customer domain got two new “data-APIs” which are read-only. There might only be one API, or two, that doesn’t really matter for the example. In both cases, the customer domain will make sure to link up the concepts of a “customer” from the CRM system and the customer API.
  • The order domain: The order domain got one new data-API, the order-data-API.
  • Example of customer-data-API data: The customer-data-API might have multiple endpoints:

allCustomers/: Serving data one “customer” per line.

stats/ : Serving data with statistics like “Num customers: 1,000, Num leads: 4,000; customer calls: 1,500, customer contacts in SME: 500, customers in SME: 600”

more endpoints.

  • Example of order-data-API data: The order-data-API might have multiple endpoints:

allOrderItems/: Serving one order line item per line.

allBuckets/: serving one bucket, which is a collection of order items, per line.

stats/: Serving data with statistics like “orders: 1,000,000, orders in 2019: 600,000 average bucket volume: 30$”; the stats endpoint might take parameters like date ranges, years,…

  • The Data APIs are read-only. The others are not. The * Data APIs have DATA as their product, perfect. You can pin SLAs to them, check their usage. The APIs are modeled as their own APIs, we will not abuse the Order API as data API. We can thus focus separately on different users.
  • The *-data-APIs could be implemented in any reasonable form like:

– As CSV/parquet files located in a AWS S3 bucket (endpoints separated by subfolders, APIs separated by top-level folders) (addressable)

– As REST APIs via JSON/ JSON lines

– Through a central database and schemata.(Yes I get that “central” is not “decentralized”)

  • Schemata are located alongside the data. (Self-describing).
  • The CRM system can be treated as both, an operational API and a data-API, but you really want to wrap it to adhere to the standard you set. Otherwise, you lose any benefits of the data mesh architecture.
  • All the Data APIs should have the same format. That makes consumption really easy! (inter-operable & secure)
  • The data APIs are discoverable via a Confluence page or any more advanced form or data catalog, we know which team owns that data and can use it downstream. (Discoverable)
  • There is a new domain. The data engineer just got his own domain of modeled data for Business Intelligence. He knows he’s serving exactly one stakeholder. This domain is wrapped as a service, serving only one stakeholder. By that, the data engineers can focus and properly prioritize the needs of the management to modeled data.
  • The marketing team can access their “order data by categories” straight from the source, as it’s domain-specific.
  • The BI system sources from the database, which we wrapped as a data service. Why? Because we’re only serving management with this, and they only want modeled and joined data which we don’t get from the APIs, and that is fine. Overall growth sounds like something that is an entity not related to one of the domains but cross-domain.

Let’s take a look at the requirements of the data users and what changed

  • The data engineer: The data engineer already receives mostly modeled data from the data APIs. That means, no domain knowledge is lost. He has SLAs to view and know exactly what he’s getting. He has an easy time using the one type of standard API used for both data-* APIs to combine the data in any way and put it into his own data-service. He knows exactly whom to ask for a specific piece of data, and all the data is documented in the same place.
  • The marketing people: Could pull the data they need straight from the order source, even if the data engineer data-service would not (yet?) supply that information. As a result, if they want a change in that data, they can go straight to someone with domain knowledge. If they want to incorporate “funnel data”, they can ask the team that actually knows what that is!
  • The data scientist: Can go straight to the order-data-API which is tested & has SLAs for the huge amount of reading he will be doing all the time. The data is there in a second and does not require hacking a DB, which is what I’ve seen done more than once. It’s production-ready and can be incorporated into the recommendation system right away. The data scientist has an easy time implementing their version of CD4ML.
  • The management: Still get their aggregate views through their business intelligence system. But possible changes can be, depending on the domain, implemented in three places, not just one. The central data team is not the bottleneck anymore.

The data team is still there, but the possible load is appropriately distributed to decentralized actors, which are better suited for the job anyway. However, the data team also has its own service. How could that look like exactly? Let’s see how a data lake still does fit into the data mesh & the possible pain points. There is an important transition state if you start with one.

Our Data Lake, Just another Node

There are three situations a, now not necessarily central, data lake or warehouse still makes sense:

  • If we want to combine two data domains to model something intermediate, this cannot happen in one of the domain but should happen in a new one.
  • If we want to integrate external data like market data. External data usually will not adhere to our standards, so we need to wrap them somehow.
  • If we transition from point A -> C we won’t just throw away our data lake, but we will trim it’s complexity down.

The Pain Points

When should you consider moving to a data mesh? First of all, if you’re happy with your structure, if you’re happy with the way your company uses data to make decisions, then don’t. But if you feel any of the following pains, the solution is the data mesh.

  1. If you have domain complexity in combination with microservices/ domain driven design, you will probably be feeling that things are too “complex” for a central team to properly serve that data at once.
  2. You think importing data into the data warehouse is costly, if that is the case, and you are therefore dismissing data sources to be imported that are valuable to individual users. Those should be served individually and are perfect candidates for a “carve-out as data mesh node”.
  3. You haven’t closed the loop of data -> information -> insight -> decision -> action back to data.
  4. You’re data -> data speed in the Continuous Intelligence Cycle is measured in weeks & months, not days or hours.
  5. You’re already moving “transformation of data as close to the data-users as possible”; That’s something we are currently working on and usually that’s a sign of a bottleneck in the data -> information -> insight -> decision -> action -> data pipe. This could be considered an intermediate stage, see the last paragraph for more on that.

Going from Monolithic Data Lake to Data Mesh

Let’s get real. A data warehouse or a data lake, together with a central analytics team responsible for importing and modeling data. A legacy monolith from where the team imports data without APIs, possibly with direct database access and lots, lots of ETL jobs, tables, etc. Maybe we got some new microservices in new domains… Let’s keep this simple but generic.

SIDENOTE: I like Michael Feathers definition of legacy code: code without tests. And that’s what I mean, big, ugly, unhappy code which no-one likes to work with.

Remember, the goal is to get all the data DATSIS, step-by-step.

Step 1: (Addressable data) Reroute Data Lake data & change BI Tool access.

All the data is currently consumed & served through the data lake. If we want to change that, we first need to turn the big switch there, while fixing the standardization of addressability for future migration.

For the purpose let’s try to use S3 buckets. We thus fix standardisation as such:

Example: A {name}-data-service is reachable via:

– s3://samethinghere/data-services/{name}

In detail all services have at least one endpoint, the default data endpoint. Other endpoints are subfolders like:

– s3://samethinghere/data-services/{name}/default

– s3://samethinghere/data-services/{name}/{endpoint1}

– s3://samethinghere/data-services/{name}/{endpoint2}

Schema versions are at:

– s3://samethinghere/data-services/{name}/schemata/v1.1.1.datetoS.???

Where we use semantic version in the format “vX.Y.Z”, date to the second.

Data files are denoted in the form “vX.Y.Z.datapart01.???”, limited to 1000 lines per file for easy consumption.

We reroute the data-lake to it’s new “address” and change the BI tool access.

s3://samethinghere/data-services/data-lake/default

s3://samethinghere/data-services/data-lake/growthdata

s3://samethinghere/data-services/data-lake/modelleddata

???

This changes nothing yet for the rest of the organization, we need to give them access.

Step 2: (Discoverability) Create a space to find our new data-* source.

We can implement the simplest form of discoverability by creating a page in our knowledge management system (i.e. confluence/ your internal wiki,…).

Alright, so now new people, other than the ones currently using the data-lake can find the data. Now we can start adding nodes to our data mesh, we can go either way, by breaking out a shiny new microservice or by breaking one of those nasty old legacy pieces.

Let’s consider the microservice case first.

Step 3: Break out a new microservice.

The point of breaking out a service is to put the ownership into the domain team creating the data, so you could, for instance, get someone from the analytics team into the responsible domain team. For now, let’s take the “order team”.

We create the new order-data-API. Fix a basic set of SLAs, and make sure to adhere to the standard you set for the data-lake. We’d now have two data-services:

s3://samethinghere/data-services/data-lake/default

s3://samethinghere/data-services/data-lake/growthdata

s3://samethinghere/data-services/data-lake/modelleddata

s3://samethinghere/data-services/order-data/default

s3://samethinghere/data-services/order-data/allorderitems

s3://samethinghere/data-services/order-data/stats

Put the new service into the discoverability tool.

A second alternative is to let the central analytics team create this data-service, in that case, the ownership would still reside there. But at least we separated the services.

Step 4: Break Out a Legacy piece.

Legacy systems are usually not as nice to work with as shiny new microservices. Usually, you’ll have some kind of database tables you’re sourcing data from you don’t even know, source some CSVs from some server or any other form of legacy, not well documented and standardized interface.

And that’s ok. You can keep it that way for now. You already have some kind of way of importing that data into your data warehouse or data-lake, so break it out of that and denote it as a data-service.

For example, you could go from:

Source DB — ETL Tool → raw data in data lake → transformed data in data lake

to a wrap around the first two stages, and use the standardization:

(Source DB — ETL Tool → raw data in data lake → S3 bucket) = new data service

(S3 Bucket of new data service)— ETL Tool → import data into data lake → transformed data in data lake

That way, when you transfer the service, the domain team only needs to switch the backbone, and dependent users can already switch to the new way of consuming data, even before the domain team takes ownership.

Step 5: (Discoverability) Switch discoverability & BI tool source.

Now start pushing your data-services to a general audience to get quick feedback, get the marketing team to source you’ve broken out. Then switch the BI tool to now two data-services, not just one.

You can then think about switching off support the support for the order data in your data-lake-service.

Step 6: Migrate ownership.

If you’re here, congrats, you’ve broken out the first parts of your central data lake now you need to make sure, the ownership is transferred as well before new feature requests trickle in for those services. You can do so by:

  • by migrating some guys, together with the service to the domain team
  • maybe by creating a new team for the new service
  • By simply migrating the service to the domain team

Step 7: Continue.

Wrap, wrap, wrap, break out more and more services. Gracefully roll out the old parts and replace them with new APIs. Start to gather new feature requests for the distributed services.

Your central data-lake will have become quite small by now, containing only joined & modeled data, as will your data team if you started transferring people.

Step 8: (TSIS) Make it Trustworthy, Self-describing, Inter-operable and Secure.

Build a common data platform. That might mean libraries for everyone to place the files in the right location or any other more sophisticated toolset. Whatever there is of duplication in the teams, you will be able to take most of it into central hands. For instance, if you notice quickly, that AWS S3 files are not easily accessible by people in marketing & sales, you might decide to switch from S3 to a central database that is accessible via EXCEL, etc.

In that case, you’d want to have a library to make that switch with a simple upgrade, without much hassle for the teams. In an AWS setup you could, for instance, create a lambda function with a generic “data-service-shipper”, that is responsible for:

  1. taking the versioned schemata and map them to a database schema in the central database.
  2. ships the data into the appropriate schema in the database.

That way, the domain team has next to no effort other than upgrading their “library”. Other options could include creating a generic REST API which you can signal the data and its location and let the API handle the rest, like converting CSV, parquet, etc. to a single format.

What part of data do I choose first to break out?

So as with microservices, the best way to start with a monolith is to break out parts, once you feel a certain “pain”. But which part do we break out first? It’s a judgment call based on three considerations:

  1. Cost: How hard is it to break out the data?
  2. Benefit: How often does the data change?
  3. Benefit: How important is the data to your business?

The benefits signal indirectly, how many use cases for a true data-service you will be able to collect because changing data implies a change in the data-service and importance of data implies that many people will want to get the insights from that data service.

If you weigh those things you can come to different conclusions. For instance, in our example, the customer domain could be a good place to start, because such data is quite likely to change often. However, it is sometimes less important than the order data which on the other hand might be hard to break out, depending on how many 1000s of ETL jobs you already put on top of that.

If you have a place to start, there are still stepping stones in your way.

Stepping Stones

The teams providing data currently as a byproduct have no incentive, currently, to properly care for that data, mostly because there is no direct feedback from the potential “stakeholders/ consumers” of that data.

That’s something that has to change and you have to take care of that as a central component. It’s probably why Zhamak Deghani proposes you take specific use cases, identify the users and take a new team to care only for that specific user. I, on the other hand, don’t see why the current e.g. order team can’t take that role. True, the shift is a little bit harder, but it’s much easier on the resources the company has to spend, and probably an easier sell.

If you are not able to get the data generating teams to jump on that train, you have two options:

  1. create a new team and take a single use case
  2. use your existing central team to take the role, and gather data. Check on the requirements on the data-service and the value it’s creating and decide later on where to push it.

Let’s finish by exploring possible alternatives to this architecture.

Are there alternatives?

I tried to come up with an alternative but realized there is more like a matrix of different implementations.

The key concept of a data mesh is decentralized ownership, where we might say since domain teams usually consider their data a byproduct that they don’t really own it. As such, a data lake is centralized ownership of that raw data.

If we now distinguish between raw & transformed data, we can see four different data architectures that are possible. We can also see 2–3 different ways of moving from a data lake to a data mesh.

Ownership of raw & transformed data can both be central or decentralized. This produces four quadrants with a variety of solutions.

What we described above if the move from “data lake” to “point B” and then to the full data mesh.

However, a second option is to implement a decentralized “ownership of transformed data” first, and then possibly think about the move to a full data mesh.

How can decentralized transformed data ownership look like?

  • A data lake could still import all “raw data”
  • The raw data could then be accessed by data knowledgable users close to decision-makers, transformed in local desktop ETL solutions.
  • The raw data could also be pushed into decentralized data warehouses, where “someone” closer to the user could do basic ETL on that data.
  • Of course, each department could have its own little data team doing just ETL for that department.

Where is the difference? In this scenario, you could collect a lot of requirements, and sharpen the exact use cases that departments have on the data. Departments like marketing are often closer to the domain, then the in-between data team, so you would gain some edge in the “domain language” problem, but not all of it. You would also still keep a central bottleneck on raw data consumption, and not push “data as a product” into the domain teams. Both of which I see necessary somewhere in the future.

The End

I tried to write a shorter post than Zhamak Deghani, but that seemed to not work out. Here are the only four places I could find information on data mesh architectures:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s