… you into three fallacies, the need to build any platform at all, the need to build a data mesh, and lots of coupling inside the platform.
“The most valuable resource is no longer oil, but data.” to put it with the words of The Economist. To extract that value from data is the new frontier for companies of this century. Data meshes appeared in 2019 to change efforts around data fundamentally. In my words, data meshes are pretty simple (but not easy).
“An organization has a data mesh, if it has decentralized the responsibility for data to the data producing organizational unit. That means transforming & serving data potentially happens decentrally as well.”
The stark contrast is the status quo still for many companies, where data is a “byproduct” for which either noone, or the analytics department is responsible.
The tech consultancy ThoughtWorks popularized the idea of data meshes in 2019, many companies followed with their implementations, and seemed to have fallen for some of the following three fallacies. Why is that a problem? Because these fallacies negate the positive effects of decentralization.
The three fallacies are….
- Need to Build a (Large) Platform
- Need for Coupling
- Need to have a Data Mesh
Need to Build a (Large) Platform
“4) It doesn’t matter what technology is used. HTTP, Corba, Pubsub, custom protocols — doesn’t matter.“
That’s from Jeff Bezos “API Manifesto” from 2002. Jeff Bezos focused on describing outcomes, not technology to introduce a larger degree of decentralization into the software at amazon. Yet almost every company aiming to implement a “data mesh” actually wants to implement some central technology stack first.
Yes, a lot of companies realise the process should start with a focus on “data products” and the role of a “data product owner” but to me, it could also simply start and stop (!) with a “Data Manifesto”.
The key idea behind that is, that centrally designing something is actually pretty hard. Because the interactions of the individual decentralized units usually contain a lot of “emergent behavior“, stuff that we do not foresee.
The fallacy here is the idea that you must build a platform at all. Where in fact, the true challenge is in pushing responsibility to the producers of data.
Need for Coupling
In 2017, two years before publicizing the idea of the data mesh, ThoughtWorks produced a book called “Building evolutionary architectures”. In it they describe the two key concepts for building modern architectures:
- an appropriate degree of coupling
- a fitness function
Why? Because modern systems need to evolve, and to evolve they first need the freedom to evolve, and second a target to evolve towards. My feeling is, most data mesh implementations today contain a very tight degree of coupling combined with a lack of fitness functions. Making them very ill fit to evolve to match any changes at all.
Some forms of coupling & stark dependencies you can see in current implementations are:
Coupled to the platform. This is the simplest form of dependency, as far as I can tell, the HelloFresh data mesh requires other teams to provide their data products through the data mesh. A good contrast to that is the data mesh built at Gloo.us. That data mesh is built around Kafka, yet it has a thin layer on top of it, a schema registry where every data product, also the ones not provided as Kafka topics are registered.
Coupled to a cloud provider. The age has finally arrived where people start to couple really deeply into a cloud provider. The JP Morgan Chase data mesh clearly couples itself deeply into AWS, using a bunch of deeply integrated AWS services like lakefront, and the likes.
Coupling inside the platform. Data developers are for some reason prone to building large monoliths. JP Morgan Chase‘s data mesh again is a good example, but to pick another one, the data mesh provided by ThoughtWorks which features Google Big Query and Google Service accounts for access management is another example. Here again cloud specific services are used which are deeply integrated, producing a large degree of coupling.
Now coupling isn’t bad in itself. It’s useful as long as one can extract value from the coupling. That’s the case here, but it’s also the case that it will be incredibly difficult to change any single part here. And if that is the case, the system as a whole will simply not be able to evolve to keep up to the changing environment, as explained by ThoughtWorks in “Building Evolutionary Architectures”.
As far as I can see it, the fallacy is to think building a platform is easier with a larger degree of coupling.
Need to have a Data Mesh
In essence, decentralizing data to the producing units, just like introducing micro services, does exactly one thing…
it increases the flexibility (speed & agility) of the whole system at the cost of adding complexity.
It’s the reason we introduce micro services, to quickly create more value and adapt individual parts faster. And we don’t mind taking the cost of a lot of extra complexity because we know that the individual micro services contribute to the business value we can deliver.
But is that also true for your data? If it is not, I personally don’t see how introducing a data mesh actually makes sense at a company. Thus the only good reason in my opinion to introduce a data mesh is because …
The company strategy relies on deriving value from data.
If it doesn’t, then the cost of adding complexity to the whole company is likely not worth the value you think you are deriving from the data mesh.
So the final fallacy is to think you need a data mesh, because you got data troubles.
What to Do About It?
So given that we know we have “good data”, our data strategy derives from a company strategy that derives value from data. And given that we realized we need a larger degree of decentralization, then we can do two things.
My first thought is to decentralize by, say using a “Data manifesto” or pinning KPIs to data, and not impose any platform at all. Let the decentralized units iteratively build the mode of interaction they need, let the data mesh emerge. Then use product-thinking to support the parts that cannot be handled by the teams.
If you really feel a central platform is necessary, I would recommend using the thinnest viable platform you can think of. That should definitely start out as a basic wiki page, and not need the support of more than half an engineer. If you manage to get traction on that, you can iterate and slowly form out a data-mesh-platform team, if not, I don’t think building anything is the solution to your problem.
Finally, if you realized that you’re currently not deriving much value from data, you should start there. Build the use cases first, increase the amount of data analysis, provide decision makers with more data, focus on that to see whether there is actually something to gain here.