Hi, I’m Sven Balnojan. I think data will power every piece of our existence in the near future. I collect “Data Points” to help understand this near future.
If you want to support this, please share it on Twitter, LinkedIn, or Facebook.
Here are your weekly three data points: Simple Data Discovery, Modern Data Architectures & Data Meshes.
1 Whale, a Dead Simple Data Discovery Tool
“What does this column mean? Where can I find the order data?” Questions that bug every data engineer, every machine learner, and every analyst every day.
There are a lot of powerhouse tools to solve these problems like datahub, alation, etc. But none of these allow a company to start small. Or to let a small company start at all.
But in my opinion, the point of having a data discovery tool right off the bat is to incentivize people to provide metadata! It’s about having a data conscious culture from the very beginning.
So with that in mind, I love the idea of a data discovery tool that is as simple as Whale. It basically produces markdown out of Postgres, Redshift,… metadata and pushes it into a speedy CLI or into a git repository to search through using your git web interface.
I really like this article because it gets into the details of actually building up a modern data architecture. It makes the distinction between operational and analytical data very clear. However, I also think two things are not that clear. First, in my mind there are four types of modern enterprise data architectures (not just the two/three depicted):
- The analytical database at the center (Redshift, Snowflake,…)
- The data lake at the center (AWS S3, …)
- The full-blown streaming architecture with every team filling streams with analytical data
- The full-blown data mesh (see below)
Besides this one point, what also came to my mind is the way the authors display machine learning/AI workflows.
I got a very simple standpoint: machine learning is engineering, it’s not “analytical data” it’s all about “operational data” in the sense that it’s about developing a feature. As such I don’t at all as the architecture depicted. Rather I find it most useful to keep machine learning engineers right in the full-stack feature team or at least in the same product unit.
Once you do that, you won’t try to funnel data for machine learning through central data lakes or warehouses but rather build APIs which work just the same way all other operational APIs work. You won’t invest in large experimental AI teams, but rather in incremental improvements that yield results in every step which may or may not include AI. And I much rather prefer that over the other setup.
3 Data Meshes
The three enterprise data architectures named above come with a dozen failure modes, which data engineers experience every day. Data Meshes are simply a step further in decentralization. Just like microservices & domain driven design decentralized tech components, so does the “data mesh principle” decentralize data ownership.
Zhamak Dehghani has now written two carefully drafted articles on that topic, both of which I absolutely love and everyone should check out.
Even if you do not consider moving to a data mesh, the principles that are beneath the data mesh, like treating data as a real product and being fully consumer-oriented as a tech team, even to the consumers of your data, are essential no matter what kind of architecture you run by.
If you’re really interested in the topic, I also provide a follow-up article on data meshes which are a bit more practical as a supplement to the two above.
P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. But I tend to be opinionated. But you can always hit the unsubscribe button!