Posted on
The three pillars of data democratization & self-service analytics. The data law as seen at Zynga, cheap & easy as seen at eBay, and the data infrastructure as seen at Facebook. (all images by the author)

Zynga, Facebook, and eBay have been democratizing their data, making it accessible and easy to use for every person in their companies for years now. Data democratization is the foundation of self-service analytics, so let’s see how you can do this too.

All three companies are very open about their process, so we can see, even though they chose three different technical architectures, they all follow the same process to data democratization of self-service analytics based on three main pillars.

The three pillars are “the data law”, “the data infrastructure” and “cheap data access”.

At the end of this article, you will have a blueprint for the three pillars, together with three checklists to guide your data democratization efforts.

But before we go into all the technical terms, let us go back 150 years, to Paris…

A Fable About Sewage

Eugene Belgrand walked along the Seine in Paris on a warm sunny day in 1870. Sunny days are lovely, but not in Paris in 1870. In 1870 the Seine was the “toilet” of all of 1,8 million Parisians. It must have stunk horribly. But worse, cholera and typhoid had just started to sweep through the city.

Belgrand thought to himself there must be a better way.

So he, together with his contemporaries, General Haussmann and all the other smart Frenchmen decided to tackle the problem of sewage.

First, they passed a law; The law stated it wasn’t allowed anymore to empty … into the Seine, as this was one of the main reasons for the cholera spread, and well, the smell.

However, people just started to empty their home toilettes on the street, which made things even worse.

Belgrand and the others realized they couldn’t just force people to be sanitary. They needed to provide them with a proper way. So, they built an infrastructure. And a massive one. Belgrand built 600 km of sewage through Paris, effectively giving everyone a place to place his disposables.

But even that wasn’t enough, The change only came slowly, people still had the “easy way out”, and still only rich people had toilets which were connected to this massive sewage system. They realized, that their final step would be to make this as easy and cheap as emptying on the streets. So they lowered the prices of sewage access, of toilets, and started to educate people.

And so by 1914, 68% of homes had sewage, up from close to 0%. People weren’t dying from cholera and typhoid anymore and Paris flourished.

Fast forward 150 years… The three pillars Belgrand and the others discovered were:

  • The law, or in modern terms incentivize people to make the right choice and educate them about it.
  • The infrastructure, give them all the possibility.
  • The ease, make the access as easy and cheap as possible.

Let’s see how today’s tech companies build those exact three pillars to support their data democratization and self-service analytics efforts.

Zynga and the data law

Zynga, founded in 2007, is the company behind FarmVille and lots of other very successful mobile games; in 2018 they had close to 1 billion USD in revenue, 15 million USD net income and close to 2,000 employees.

In their 12 years of company history, they introduced a data-driven cult that put data at the core of every decision made at Zynga.

Zynga dedicates people to analyze and design experiments for every game. The tracking points have to be part of every new application they build. Zynga even developed a system called ZTrack to be able to track how users interact with the games.

Through those tools, Zynga is able to shift the direction of entire games, as they did with FarmVille.

So how do people actually access the data to make their decisions at Zynga?

Zyngas architecture features two important decisions:

  • they provide a dedicated “service” for ad-hoc SQL access. They did so by simply making a copy of all of their data that can be accessed by anyone in the company, providing a different set of data guarantees, which is not as strict as on the operational data.
  • they provide at least two kinds of access, ad-hoc SQL access for everyone, and scheduled reporting and business intelligence tools.

Schematically, the access seems to look like the following picture.

End users have two paths to access data. One path is through some component, i.e. some reporting functionality that then accesses one of the databases. The second path is ad-hoc direct SQL access to another database.

Zynga and “the data law” condensed:

  • Data-based decisions are hammered into the values of the company.
  • Every product decision features a hypothesis, A/B tests and possible ways of measuring it.
  • No product is developed without implementing the tracking to see how to further develop that product.

Let’s turn to a much bigger company and see what they did differently.

Facebook and the data infrastructure

Facebook decided to go in a slightly different direction when they chose how to build their data warehouse. The data is a multitude of that a Zynga, so Facebook chose to build a huge Hadoop cluster and provide everyone in the company access to that.

Realizing that access at that scale is a problem that is not solved by simply providing access and saying “you have to use data” facebook did what Belgrand did, they started to build infrastructure.

Today Facebook has a complete department filled with data infrastructure teams. Data engineers work in technical teams supporting the product decisions, and the whole department helps the company make better decisions.

But back in 2007–2009 facebook hit a wall with their previous approach of storing data. So they decided to move to Hadoop. Before that, the data was an SQL Query away, but now it was much more complicated.

Non-technical users couldn’t access the data, so Facebook decided to act.

Facebook developed Hive, giving users an SQL-like interface that is accessible even for non-technical users. Now, through that infrastructure, just like the sewage system, everyone can access all of the data with ease.

If you compare the access architecture it’s slightly different from what Zynga does, but it provides similar benefits at Facebooks scale. Schematically it could look like this.

Unlike the first model, the second model has one major interface to translate non-technical end-user questions to technical questions that can be run across all the data the company has.

Facebooks and “the data infrastructure” condensed:

  • Accessing data should be as easy as asking a question. That’s kind of what “SQL” provides.
  • Non-technical people should be able to access most or all of the data. The infrastructure at Facebook is able to get access to all of the petabytes of data, not just a small part, but all of it. Just like the 600 km of sewage in Paris.

The most important step of Belgrand and his friends, that drove adoption of sewage from 1% to 68% in a matter of years, was to make it cheap and easy. The final company we look at did exactly that.

eBay and cheap data access

eBay has been through years of making it cheap and easy to access data for everyone.

They went through a bunch of steps making it as easy as possible to access data.

The architecture eBay chose is again a different one. eBay collects data centrally and then provides a “bucket or cube” for each team just with their data. This approach is great from a “Datensparsamkeitsperspektive”, and makes it possible to put the “transformation of data close to the end-user”.

The third model, unlike the first two, realizes that data needs may differ quite a lot between groups of end-users. So end-users are provided with their own little data “mart”, they can access directly or through any tool, they choose to use in their group.

To get people to use the data, they need to know where it is. eBay simply started out with a wiki, explaining what is where combined with a metadata repository.

They then added pictures, showed how data is connected, tag data and turned the whole system into a product called the “DataHub”. One central place to find data.

But they did not stop there, in 2015 their then state of the art data democratization platform features:

  • Alation as metadata directory,
  • discussion boards, together with the links to the teams who own the data & know about it,
  • an SQL assistant combined with the metadata repository,

making it a seamless experience to check out data. eBay provides a Slideshare with the whole journey.

eBay approach to cheap and easy data access condensed:

Turns out, what facebook follows can be understood as the “DATSIS list” of attributes used in the data mesh approach. Applied to eBay they are:

  • Make the data discoverable, for everyone; eBay did so by providing a central tool like the “DataHub” or Alation.
  • Make the data addressable; eBay did so by providing the connection information in one central place with it’s metadata repository.
  • Make the data trustworthy; eBay did so by providing names of people who know things about it, SLAs for the data, and more.
  • Make it self-describing; eBay did so by providing proper metadata.
  • Make if inter-operable; eBay did so by providing SQL assistance, and base everything on SQL.
  • Make it secure.

Great, we now got three pillars, and one checklist per pillar to work through. Wha now?

Where do you start?

You start, where your bottleneck is.

The pillars are a good image of what is happening. If you have no roof yet, you can start anywhere. You can provide three short pieces of wood to hold a light tin roof.

But if you want to support a full-fledged stone roof, you’ll have to work on all the three pillars, again and again making all of them solid.

All three pillars can be a bottleneck, in your process of letting the company make quick data-based decisions.

If no one knows how to make data-based decisions and write SQLs, you’ll need to work on the “data law”, educating people about it, building tracking into the products and so on. That’s why Zynga puts much emphasis on it, and why Spotify has whole teams that work on databased decisions for specific products.

If people simply cannot access the data in a congruent way, no one gets the data to make their data-based decisions. That seems not to be the case yet at Zynga where multiple databases are in use, so that’s not their bottleneck. But for companies like Netflix, this seems to be the case. Netflix developed a tool called Metacat to solve this problem and provide one unified infrastructure.

Finally, if it’s not easy enough to access the data, people will not get the data in time, and thus cannot make quick data-based decisions.

So you will have to watch all three, go from bottleneck to bottleneck to make decisions better and faster than ever at your company.

Do I have to give them direct ad-hoc SQL access?

No! Even though Zynga, Facebook, and eBay apparently do so, it really is a question of how your specific organization works. The bottlenecks you have might be solved by providing one central data lake, providing one central business intelligence tool, and educating people on the use of that.

But the three pillars above stand, the detailed plan for each of them fits the most central architecture you can think of just as fine as it does to the most decentralized you can imagine.

More resources

All of the resources linked below provide information for this article.

Leave a Reply