Organizing Data Teams — Where to Make The Cut

Office Hours

There are four ways to decentralize and structure data teams. Learn how to choose the right one.

(The four typical data team organization forms. Image by the author.)

Introducting Data Organizations

Data organizations within companies look like snowflakes. From close up, they are all unique, but if you step back, they all kind of look alike. They all deal with data and are usually organized around some data or analytics department.

That makes it hard to make organizational changes because it’s really hard to see the overarching picture. I like to propose a simple viewpoint that might make this easier.

I think these snowflakes come in four snowflake buckets. And really only one feature distinguishes them: Where in your data flow do you make the cut and go from a central unit working on the data to multiple decentralized ones, embedded into other units.

Let’s highlight that using a bunch of examples from the great survey article fishtown analytics provides, as well as some additional cases.


The Centralised One Man Show

Companies usually start out with the one-man show. One to two data scientists in a data team that handles “everything around data”. That usually includes…

  1. “Ingesting” one or more data sources from around the company. At Airbnb, the first data scientist (employee no. 10) probably ingested the Airbnb bookings in a long table of dates and amounts.
  2. “Transforming” atomic data into usable information. E.g. aggregating amounts by date, adding some useful information like the booking region, still in a long table in some database.
  3. “Reporting” the information to let other people get “insights” from this information. E.g. by getting the database data into a pretty form which can be filtered by date in a graphical user interface.

These three tasks take up whole disciplines. They might be called data engineering, analytics engineering, and business analytics respectively. All three tasks are mostly chain-linked. The data flows from (1) to (2) to (3). So once we decentralize (2) we also have to decentralize (3).

(A possible workflow, data emitting teams with AWS S3, databases, and REST APIs. Ingestions through Apache Airflow into an AWS Redshift, transformation via dbt, and reporting in Looker. Image by the author.)

At a young start-up, with just 1–2 data scientists, just one team, all these tasks are centralized into one unit. It’s how data teams usually start out. It is for instance, how the company M.M. LaFleur, a wardrobe solution for professional women, organizes their data team:

“We are a small but mighty team of two,” Kailin says. “We use Stitch to pipe raw data into our warehouse, and dbt to maintain our data models and ensure we have clean and updated data to work off of.” One analyst focuses primarily on supporting the marketing team and the other focuses on supporting sales and inventory. Both analysts report to the Chief Product Officer.” (Taken from the fishtown analytics survey article.)

But when the business value of the data grows along with the company, the question becomes where to go next? Stick to the central form and have multiple teams? Or start making a cut somewhere?

The Four Data Team Organization Forms

I don’t think there is a right answer. Instead, all the following four options have their place, their strengths, and their weaknesses:

  1. Keep everything centralized into a larger “analytics” or “data” department.
  2. Decentralize the reporting. Put business analysts into distributed units like marketing, sales, and so on.
  3. Decentralize both reporting and analytics engineering. In essence, letting each distributed unit run their own “marts” their own little parts of a database, create their own information.
  4. Decentralize everything, data & analytics engineering as well as reporting. In essence, either let every function have their own analytics team OR work with the concept of a data mesh and let the data emitting teams handle some of the data engineering.

The last three options will usually be supplemented with infrastructure as a service team supporting the decentralized actors.

(The table summarizes some strengths & challenges visible in the case studies. Of course, a complete treatment of the challenges put the decentralization & centralization challenges into focus. For more on that I suggest my article on decentralization principles.)

(1) Strong Central Analytics Functions

Trevor Schulze joined the semiconductor giant Micron Technology in 2015 as CIO and led a small revolution of the data teams. Before his arrival, data science was mostly decentralized and only used sporadically. Leading to both high cycle times, enormous efforts in data preparation, and inefficient allocation of the data science resources throughout the company, as Schulze describes in an interview with Forbes.

Schulze decided to create a central enterprise data science team in close proximity to the central business intelligence unit speeding up the time to delivery for forecasting and prediction solutions. They then used the powerful central approach to roll out data science throughout the company.

Micron seems to be a good case of a company that was stuck in decentralized micro improvements where a large centralized macro improvement was necessary within the data department.

(2) Decentralized Business Analysts

The MOOC company udemy and the company Prezi chose a different route. In an interview with Chart.io, both companies describe their current approach and where they are headed. Both currently employ or aim to employ a model with a central data provisioning team, as well as distributed analysts.

They realized two truths. One: Since the tasks required to produce good data-based decisions are chain-linked, adding more analysts does not increase anything until the steps before are also supplied with more resources. Hence in their case creating the need for central data provisioning.

And second: The task of data analysis and decision-making sometimes requires a huge amount of domain knowledge which is really ill-suited for central teams to handle because of their bottle-neck position. Hence in their case pushing the business analytics into the functional departments like marketing and sales.

But this structure also carries a weakness, which led the company HubSpot to change its organizational form again to a more distributed version.

(3) Decentralized Analytics Engineers

The company HubSpot, a SaaS company with 680 million USD in revenue in 2019 and close to 3000 employees decided on a slightly different approach. In the past, they kept the two tasks “ingesting data” and “transforming data” closely together in teams in one data function. In addition, analysts were and still are distributed across the company in different functions.

However, HubSpot is in a progress to push the “transformation of data” more and more into these analysts by using Snowflake and dbt.

“This is about education. We want to continue building an internal community around Snowflake and dbt to empower our analysts to get the most out of what these tools can do together.” (Taken from the case study getdbt provides.)

The following quote describes well why HubSpot is turning in this direction:

“At any data organization in any company, you typically have a lot of analysts and fewer technical resources. This always creates a blocker to productivity. Whenever an analyst needs a new column or data grain, they have to go to a data engineer to get it,” James said. “We’re an organization of over 3,500 people. If we need to hire a data engineer for every 2–3 analysts, that’s just not going to be cost effective. It doesn’t scale.(taken from the case study getdbt provides; What HubSpot describes here with data engineer includes both what I call data engineering and what I call analytics engineering. Their aim is to separate off the analytics engineering part.)

These data analysts own their dashboards, build new ones in SQL, communicate their findings, and bridge the gap between the technical and non-technical sides. At least according to their job descriptions.

But keeping data engineering central still can lead to slow cycle times because it still produces coupling. If the other parts, analytics engineering & analysts & the decision-makers on the other side are moving much faster, yet a more decentralized model is necessary.

(4) The Decentralized Everything Mart

The company Spotify pushes the decentralized model in some parts even further, building up mini teams that own the process described above end to end. These are the so-called “product insights teams”.

In particular, product insights teams contain data engineers, analysts, data scientists as well as UX designers helping teams to both do deep ad hoc analyses before launching a new feature as well as running extensive A/B tests while launching, without relying on central data teams to engineer new data.

Product Insights teams are paired in one “unit” with the product team building a particular product. So really the whole ownership lies inside the unit.

The benefit in Spotify’s words is gained speed in research, or to put it into context, to make evidence/data-based decisions for products much faster.

Another pattern to decentralize almost everything in the data chain is using a data mesh, which would push some of the data engineering work into tech teams, and combine that with a strong platform team to connect various data sources. It would put more stress on the analytics engineering team but would speed up the whole chain considerably in larger company contexts.

Thanks to Tristan Handy for some helpful thoughts!

Further Reading

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s