How to avoid the merge hell, speed up delivery of business value, reduce defects, and live happily ever after in your data warehouse.
“We needed an extra day to merge the transformation branches together”, “Ah yeah but there was a bug once we finally got the data to production, so we had to redo some stuff for another 2 days”,… sound familiar? To me, it seems like data and analytics engineers are particularly prone to run into the “merge hell” or the “defect in production” scenario.
But there is a good software engineering practice that can resolve these problems altogether! It’s called “trunk-based development” (TBD). But for some reason, whenever I talk to data people, they think TBD is not applicable to data pipelines, reports, cubes and all that data stuff.
In this article, I’ll try to explain the basics and hopefully show with two examples that TBD is not only applicable but actually makes life as a data guy fun!
What is Trunk Based Development?
“A source-control branching model, where developers collaborate on code in a single branch called ‘trunk’ *, resist any pressure to create other long-lived development branches by employing documented techniques. They therefore avoid merge hell, do not break the build, and live happily ever after.” (Paul Hammant, https://trunkbaseddevelopment.com/)
Even though there are three main variations of TBD, I like to reduce that complexity for now and simply imagine: TBD means you always commit your work straight to the trunk!
Think about that: Since you probably got a CI system set up, and commit 1–10 times a day, this will trigger the CI system 1–10 times a day, run all tests, and might even deploy whatever you do straight into production. Of course depending on how your system is set up and how far you are in the CI & CD workflow.
That is a scary thought for most data guys who are used to creating a feature branch, work in it for a couple of days and then merge with other peoples’ work.
And besides, you don’t create a feature branch for no reason, right? You create it, so your work doesn’t mess with either production or the work with the other people who might deal with the same tables and pipelines. You create it, so you won’t push “crap” to production.
That sounds like a huge mess when working on the trunk all the time…
How Does This Transfer To Data Pipelines, ETL, Reports,…?
Trunk based development, as explained by Paul Hammant, uses two important techniques that will help you actually do trunk-based development:
- Feature Flags (or Feature Switches)
- Branching by Abstraction
If you employ both of these techniques together, your data pipeline work will soon become fun, without any merge conflicts whatsoever. I’ll share two examples to explain the two techniques which will also show why all the above worries are for nothing.
But first, let me recall three important rules to work with:
- Commit in very small chunks! Ideally a bunch of times each day.
- Run your “build” locally first, only check-in “locally working code”.
- Never break the “build”. If you do, use all resources to fix it, before continuing.
Now let’s explore the examples.
Example 1, Feature Flags
Feature flags are there to deliver new functionality quickly, but safely. Like this:
main(use_old_feature = True):
// run old code
// toggle on run new_main function!
How do feature flags help with trunk-based development? A huge problem is that if we commit to the trunk all the time, things get deployed and might break the production system.
So why not create the good old feature branch? Because then, neither our nor the other team members’ code will be continuously integrated with each other. So, to our rescue comes the feature toggle, just as shown above.
How do I use a feature flag to still integrate the code? Easy: I run my tests locally and on the CI with “use_old_feature=False”, but use “use_old_feature=True” for deployment to production via config files for each environment.
If we’re working on a new report in Tableau, which involves
- Ingesting some new data via some Python script
- Transforming data via dbt
- Displaying the data in Tableau
We could use feature toggles both on runtime, by using a permission system to show the report just to us, as well as switches on build like the ones above, or at the transformation phase directly via dbts exclude feature (https://docs.getdbt.com/reference/node-selection/exclude/
A possible config file then could look like this:
– Feature1 #from Tim ITD-233
– Feature2 #from Eve ITHD-23
– Feature1 #from Tim ITD-233
Example 1, Continued as Canary Release & Cutting Tasks Vertically
One great benefit of feature toggles via e.g. permission systems is that we can use them as canary release mechanisms. Meaning we can show the rough versions (like the 1st and second rough draft) to end-users to play around with and thus get very early feedback even on test data.
Canary releases in my opinion work best if we try to “cut our tasks vertically”. That means instead of cutting the above task into:
- Do the ingestion
- Do the transformation
- Finally, do the visual stuff
We cut our work:
- Produce test data via, quickly run a dummy transformation => put up the first report & get feedback (from automated tests as well as people)
- Run a second iteration with actual ingested data, and some transformations
- Finalize things.
In this process, we get 3 times as much feedback and get to the end-user much faster. We have much higher test coverage and get to integrate all our codebase all the time.
Example 1, Continued With A Warning
Martin Fowler apparently has a very similar opinion about feature flags, he says “they are the last thing you should do to hide a released feature”. He instead thinks you should cut as above (ingestion, transformation, and finally visual stuff) but simply not turn on the end-user side before it’s done.
Both ways work, feature flags work. You now got 3 different tools in hand to work trunk-based! Let’s look at the final tool which is just as important.
Example 2, Branching by Abstraction on the Data Pipeline
Ok, so that’s how you push stuff to production and test it in production, without messing things up for the end-user. But what if you want to work on a part others also work on?
In data teams that seems to happen often and is usually solved by “you branch, I branch, then we merge”.
A solution is to “branch by abstraction” which means you create an abstraction at the point where your work collides and thus can work on your own.
Let’s take a look at an example.
Say you’re working on a table called “dim_customers” and your teammates are also working on that table on different columns. You got a transformation job that takes “customer registration data” and “researched data”, does some things, and finishes the transformation. You want to add a few columns for the research data, while the others want to mess with some other columns. If you do not branch, someone has to always pull the changes from trunk and in essence, you end up having a merge hell as well.
So what do you do about it? We create an abstraction.
Now you simply have to follow the branching by abstraction process:
- Abstract what we want to change => add a second transformation “Research Transformer” and e.g. a “select *” to include it in the “Transformer”; Run the tests to make sure your refactoring worked out as imagined.
- Write the second implementation of this abstraction, your new one (Now your teammates can do whatever they want, release & commit to the trunk as much as they want!).
- Test your stuff with this implementation (by using a feature switch to turn it on for the tests!), maybe even push it into production with the feature switch on for admins.
- Do a tiny commit to switch to the new transformation.
- Refactor (if needed) to include the transformation!
Benefits of TBD For Data Teams
Still don’t see the benefits? TBD is great in itself, and it allows you to actually do both CI & CD. The benefits of constantly committing to the trunk mean that what you work on is far away from what the codebase looks like in a day when your colleague has merged something. By reducing this you:
- Break much fewer things unexpectedly
- Spend much less time merging things
- See duplicate code much faster and thus can reuse what others work on
- Spot incompatibilities early on and thus speed up development
By continuously integrating your code base with the existing one and continuously delivering it to your environments, you get a lot of feedback by interacting with production data, testing frameworks, etc.
How Do I Choose The Style?
If you’re still not convinced that trunk-based development is great for data teams, it is probably because you see a bunch of roadblocks. But wait, most of them can be removed pretty easily.
Two things usually come in the way of trunk-based development, one is the code review, which a lot of teams do as a substitute or addition to pair programming. Teams might argue, without a feature branch you have no way of doing a code review. Thanks to modern code systems that is no longer the case. See the explanation & best practices listed here.
The second thought which might come to your mind is how to specifically work this out because committing to the trunk for some reason does not work for you. In that case, I still suggest reading through the three different styles discussed here.
- If you want to go deep into the feature flag/toggle/switch topic, read Martin Fowler’s article on the topic: https://martinfowler.com/articles/feature-toggles.html
- The website created and maintained by Paul Hammant is a great resource for really all you need to know about trunk-based development.