Opinion
On Redshifts, Data Catalogs, Query Engines like Presto, and the troubles of machine learning engineers to get their data.
Data Meshes are the hot & trending topic in data & analytics departments. Implemented at big companies like Zalando, and moved from the “Trial” to the “Assess” status of the ThoughtWorks Technology Radar, within just one year. Yet the results I’ve seen are not overly impressive.
Quite a few articles raising concerns have appeared throughout the past year, and at least I have gotten quite a bit of question & confusion about the topic after publishing my first article about data meshes.
Most concerns & confusions seem to have one idea in common. The idea that there is just one kind of data mesh — which closely resembles the ones described by the BMW Group and Zalando.
There is actually a continuous spectrum of data meshes, all with different strengths, very different technical architectures, suitable for different end-users, very different in the size of involvement of teams. But above all, they differ in what they can do and what they should be used for. So let’s have a look at three important types of architecture.
The Three Kinds of Data Meshes
Let us look at a centralized, a decentralized, and what I consider the “standard” data mesh architecture.
1. The standard data mesh
Imagine a huge bunch of data buckets, created by a central data platform team. One bucket contains data from team A, another data from team B, and so on.
Imagine in addition, some cool access points, some way of “looking into these buckets”.
Finally, imagine a catalog of all the data, and a cool way of searching through it, maintained by the teams.
Team A and Team B are thus directly connected to the end-users of their data. This results in both teams taking real “ownership” or their data in the sense that they now got new customers, data-customers. The data customers like machine learning engineers, data analysts, or data engineers have special requirements that the teams now can process & implement just as any other product requirement.
This is what I’d like to call the “standard data mesh” although, from my point of view it is actually not what Zhamak Dehghani discussed in the original data mesh article. Nevertheless, it seems to be what people perceive to be the standard option.
2. A mostly centralized data mesh
Turns out, the core principles of data meshes don’t really depend on the technical implementation. I like to highlight a very central implementation.
Imagine one big bucket, with a smooth way of looking into that bucket; In addition imagine 10 little pipes that let the teams A, B, and so on pour data into our big bucket.
Imagine also, the teams pour data in roughly the same format into that bucket.
Now all of a sudden, our end users get their data access in an extremely fast & smooth way; The data is standardized, so they probably get lots of standard views, and can extract information from different sources quickly & efficiently. The data platform team can launch this thing quickly because there’s no need to connect lots of different data. And still, the data can be owned by the teams. They can still be in charge of serving their data-customers, transforming data, and so on. Ownership, transformation & serving here is really not connected to technology, but much more to whether teams come with this “data product mindset” or not.
3. A mostly decentralized data mesh
How about going to the other extreme?
Imagine ten very heterogeneous data buckets, not just the standard size.
Imagine ten pipes, which now siphon the data into our big bucket.
Imagine again some smooth data access mechanism. This time, however, teams A, B also built custom access points to their data pools to enable their more special customers to access their specific data.
Sounds like a lot of work for the central data team, but with extra flexibility for the data-customers! And voila, we arrived close to the other end of the spectrum.
A Definition of Data Meshes
In my eyes all three such architectures sound like data meshes, in their core implementing data decentralization, pushing the ownership into the teams, pushing data product thinking into the teams. So let’s try to find a definition.
Definition
I like the following definition of data meshes:
Data meshes are a decentralization technique, a data mesh is the decentralization of the “ownership, transformation & the serving of data”.
With that definition, data meshes are a decentralization technique just like microservices or micro frontends (which I elaborate on in another article).
The Continuous Spectrum Of Data Meshes
Technically, you could argue that the “mostly centralized” data mesh above sounds like the data ain’t owned or served by the teams. I beg to differ. Even if we had 10 teams pushing data to Google Analytics, would you say the data is owned & served by Google Analytics? I hope not. Of course, it’s served (through a third-party tool) by the individual teams, and it is owned by the individual teams, although
Technically the chosen solutions fall together. Even this completely centralized option still features the key concepts of a data mesh; It still powers data product thinking as long as individual teams are held responsible for their data. It features a distributed domain architecture as long as teams get the freedom to work on their data (in the given bounds) in their own way; it features infrastructure as a platform by for instance providing common interaction tools like the google tag manager or google data studio in a platform way.
Whatever kind of data architecture (data mesh independent) we choose as a company, we will end up needing some kind of “glue”, some way of gluing distributed decentralized parts back together.
It is that glue, that I see as the varying degree of the continuous spectrum of data meshes. On the one side is “really strong glue”, allowing for almost no variations, on the other side is “really light glue” allowing for parts to move around a lot.
Exemplary Architectures in Detail
Let us look at some possible architecture implementations of the three main discussed data meshes. I choose AWS as my point of infrastructure reference because I feel most at home there. But
You can always exchange AWS S3 for GCS, AWS Lambdas for GCFs, and so on.
Let’s take a simple example: An e-commerce company, with a website, serving a large variety of articles. They have a couple of teams:
- “Team Orders”: owning the complete order structure from “add this to basket” to “submit this order” (and everything that happens afterward)
- “Team Details”: owning the article detail pages.
- “Team Front Page”: owning the front page, including the search engine & the search engine results pages.
- “Team Accounts: owning the registration process.
- “Team X”: the data platform team building the data mesh.
We also have a few data end-users:
- “Team Recommendations”: the recommendation engine team, working mostly with Team B (through a nicely done micro frontend, so mostly decoupled.)
- Team Marketing: working together with both Team B on the article details texts, as well as team C for the content on the front page.
- Team Management: Working with all teams, especially interested in key metrics and ways to advance the company and the products.
Alright, let’s get started!
1. The centralized data mesh
An organizational unit, with a strong focus on only a handful of key business data concepts, might choose to implement a central tracking API, served by the teams, including standard reporting & analyzing capability built on top of it.
First, let’s see what team X has built for us:
- A central data tracking API. The only thing you can do with it: You can send data to it.
- A standardized JSON schema featuring mandatory fields like “data_owner” and “timestamp” as well as the “acting_customer”, the “action” and an “action_category”. Optional fields are “order_value” and “value”. (Borrowing heavily from the GA tracking API)
- One central access point for people with tech skills, a distributed SQL-interface & query engine called Presto.
- One central access point for reports & dashboards called redash.
- Inside redash, the team has built a bunch of standard reports like “top events”, “most active acting customers”. They allow sorting and aggregation by date, event_category, value, and order_value. For everything with an “order_value” the team built a special “orders” section with a bunch of extra reports filtered for that data.
- A central tracking data retrieval API, allowing for bulking & streaming meant for machine learning teams or data engineering workflows.
I like to think about a single piece of data flowing through the engine to understand it: Let us take one individual piece of data, the “order placed” data piece.
- A new customer clicks on “finish registration”.
- Team Accounts: Decides to share the registration data.
- Team X: The central API receives a standardised data piece {data_owner: “team E”; timestamp: “”… key_data: {acting_customer: 123, action: “registers”, value: “100” order_value:”0”}}
- The customer clicks the “submit order” button on the website, an order event is emitted in the backend.
- Team Orders: Some service picks up the order event, processes the order, and sends a data point to the central API.
- Team X: The central API receives a standardised data piece {data_owner:”team A”; timestamp: “”… event_uuid: …””, additional_meta: “”… key_data: {acting_customer: 123; action: “buys something”, order_value: “103,4”}}
- Team X: the central reporting solution generates reports with order amount, most active customers, actions, etc., everything the standardized dataset provides.
The recommendation engine team enjoys the standardized data; no need to hack databases or ask for a central API. If the recommendation engine team wants to expand into a second product, they already got all the data they need, in a standard format. The data is well documented because it’s used by everyone, and well cataloged.
The marketing team enjoys good data coverage and that they have a lot of reports to view the data right off the bat. No need to annoy an analytics department to produce lots of different reports. The management got everything they need at their fingertips because everything revolves around the few key business concepts they use in the company. If they need some specific report, they can use the custom CSV export of the data and run some Excel magic over it.
The data platform team is happy because they got this project from an initial prototype to having a lasting impact on the decision making capability of the company extremely fast.
2. The “standard” data mesh
Two German giants, Zalando the online clothes retailer, and the BMW group recently shared some insights on their efforts to build something I’d call a data mesh. Both Zalando’s data mesh and the BMW Group’s feature an architecture I’d put mostly into the “standard” option bucket, although Zalando is a little bit more on the decentralized side than BMW.
Let’s see how Team X could build a simplified version of this for us:
- A central metadata REST API, allowing setting new metadata, updating, deleting & retrieving metadata for a given set of data. This is necessary as we don’t have a standardized schema here.
- A central data catalog, sourcing from the metadata store. The team decided to go for Linkedin’s datahub in this case for both the API and the data catalog with a nice UI.
- A collection of AWS S3 buckets, each owned by one team. The teams are bound to a naming convention following “{team identifier}/{data identifier}/…” for pushing their data into these buckets.
- Again one central access point for people with tech skills, a distributed SQL-interface & query engine called Presto, also sourcing from the metadata API.
- AWS Lambda functions that can be used by teams to shovel data from AWS S3 into a central AWS Redshift database.
- Again the tool Redash on top of the AWS Redshift database. This time team X cannot create any reports, instead, they allow teams to do so if needed. An analytics engineering team can this way easily shovel some data into the AWS Redshift database & thus create dashboards for the management team. The marketing team gets most of their reports from teams B and C which run experiments together with the marketing team.
The data flow: Let us take one individual piece of data, the “order placed” data piece.
- A new customer views the front page, starts registration, and then clicks on “finish registration”.
- Team Accounts: decides to share this data. So they send The metadata to a central metadata API: {“name”: “the username the customer chose”, “timestamp”:”time of registration”, “source”:”the source we extracted from Google Analytics associated with this registration”}
- Team Accounts: They send the actual data to their own AWS S3 bucket within the larger bucket-lake set up by the central data team for all teams.
- A customer clicks the “submit order” button on the website.
- Team Orders: Somewhere in the backend, a service from team A picks up that event, processes the order. Also, Team A decided to share this data so they sent The metadata to a central metadata API: {“customer_id”: “the customer id we got from team E”, “timestamp”:”time of order”, “items”: … “gross_price”,…}
- Team Orders: They send the actual data to their own AWS S3 bucket within the larger bucket-lake set up by the central data team for all teams.
- Team X: The metadata API got new metadata, so it’s put into the “data catalog” for everyone to access.
- Team X: On top of this metadata, we cannot generate standard reports. We can however create a new access point, which then can be used through a generic standard interface anyone at the company can use.
- Team Machine Learning: Also receives new data through their own little “subbucket” inside the order teams AWS S3 bucket “/team-d/recommendation-interface/”, but what they get is an update of their bulk data package containing the orders of the whole month.
- Team Front Page: Pushes the data from the front page views which were parts of an A/B experiment into the API, and creates a new report on top of it in redash for team Marketing.
The recommendation team enjoys this approach because they can ask the order team to provide just the data they need, at the speed they need it. With this, they are able to generate close to real-time recommendations. The marketing team enjoys the freedom they have in generating reports, which are easy to do through the generic interface. They can generate new reports on the fly for every new activity they employ. The management is happy because they don’t have to dig through a lot of boilerplate and instead get their numbers through dashboards specifically created for them.
3. A mostly Decentralized Data Mesh
Let us go one step further in the direction of decentralization. Our little e-commerce company already realized quite some time ago, that autonomy “with boundaries” is a good way to work. As such, they started to maintain infrastructure as a service framework for a bunch of storage technologies like Postgres, AWS S3, and a Greg Young event store. As a result, all teams use one of these storage forms. The company also has best practices for storing the resp. metadata inside each tech choice.
Let’s make use of that! Team X built:
- A REST API where teams can register their data sources. A team can register a “Postgres” dataset, and provide some additional company-specific information.
- A metadata service that pulls the metadata, so Postgres + Postgres column & table comments and EventStore metadata streams into the central data catalog again running on datahub.
- A custom graphical interface that allows cross source querying of all of these sources by mapping SQL requests into all three technologies.
The data flow: Let us take one individual piece of data, the “order placed” data piece.
- A new customer clicks on “Finish registration”.
- Team Accounts: decided to share this data. They use an AWS RDS instance for their shared data, which is just a managed Postgres instance. So they use standard column & table comments to store the metadata of this new data piece. And store the data in one table called “registration_events”
- A customer clicks the “submit order” button on the website after viewing a new version of the “article details” pages.
- Team Orders: Decided to share this data. They use a Greg Young Event Store. So they store the metadata on a custom lightweight metadata API the company built for these stores. And store the actual events in their “orderDataStream”.
- Team X: The central data platform team has built connectors for all three standard technologies. Once the new metadata comes in, two connectors fetch them and again, store them in the “data catalog”.
- Team X: Also has built connectors for the data, taking it and siphoning it into an AWS S3 bucket just as in the very first case.
- Team X: On top of this AWS S3 bucket the team has built mostly the access technology mentioned in the “standard option”, like a common querying and dashboarding interface using Apache Superset and Presto.
- Team Details: This time the details team ran an experiment together with the marketing team. To share the data, they use the Google Analytics feature “experiments”, push the data into Google Analytics, and share it with the marketing team.
The individual tech teams are really happy with this option because they are extremely quick in sharing new data. They use the tech stack they already know and can specify the data in any way they want. The recommendation team again can come up with an individual data set just for them, supplied in the respective technologies. To build their first prototype, they use the larger standard tool to get an initial data draft, so they already got something to show when they start talking to team A about a more customized data set. The marketing just launched a new marketing automation tool and is also happy to be able to shovel some of the data from team E from registrations right into that tool after the team agreed to quickly provide just the data they need.
Three equally strong Solutions?
Looks like all three options have lots of strengths! But I like to think about strengths as simply the mirror sides of the accompanying weaknesses. So every single strength displayed above also has a mirror weakness which makes the system really hard on other circumstances or requirements.
The most decentralized data mesh is great in speed for the individual teams as well as in customization. However, in a context with data with just a few key business concepts, this will prove extremely slow in increasing the decision-making capabilities of the company. In addition, the recommendation team can be slowed down, if it chooses to work on multiple projects with different requirements on the data interfaces.
The centralized data mesh will cause problems for end-users, once they need customizations. If the recommendation engine team needs to get real-time data, for instance, the standard interface might not do it anymore. The same goes for in-depth reports & analyses by both management and the marketing team.
The in-between option might seem like a good compromise then, but in reality, it will bring both weaknesses together.
Data Meshes are Tools
I do believe that a high degree of decentralization is the future for most things in technology. But if you don’t have a problem that decentralization solves, then you should not try to employ a data mesh. What Zalando describes as their core problem is a problem, that is possibly solved by decentralization (see the webinar on their domain crossover problems); And if you think you need to decentralize data, you still should choose the right tool for the job.
Choosing the right tool for the job is a hard and crucial decision, but in a world of exponential data growth (see my take on that), you will have to learn to make it.
Hope this article provides you with some guidance on finding the right tool if you need one at all.
A special thanks to Christoph Gietl for many helpful comments.
Further Reading
If you want to catch up to some of the topics mentioned above, you can have a look here:
- 2019, The very first article about data meshes by ThoughtWorker Zhamak Dehghani.
- 2020, The second article by Zhamak Dehghani on data meshes.
- 2020, ThoughtWorks Webinar on the Zalando data mesh.
- 2020, AWS ReInvent talks about BMWs data “hub” which in my definition is a kind of a data mesh.
- 2019, Data Meshes Applied, my first article about data meshes.