Three Surprising Books Every Data Guy Should Read…; ThDPTh #2

Refactoring, Working effectively with Legacy Code, and Test-Driven Development for Data Guys.

…on software engineering.

Hi, I’m Sven. I think data will power every piece of our existence in the near future. I collect “Data Points” to help understand this near future.

If you want to support this, please share it on Twitter, LinkedIn, or Facebook.

Here are your weekly three data points: Refactoring, Working effectively with Legacy Code, and Test-Driven Development.

Why three software engineering books for data guys? Because I believe every data team should be treated as an agile development team.

1 Refactoring by Martin Fowler

Whenever I take a look at an SQL statement, they remind me of this…

(Photo by Taylor Brandon, Unsplash)

… these organically grown buildings you find in slums. They do their job. But no one would build them this way from scratch because you cannot find anything here unless you’re one of the residents!

And that’s what I tend to think when I look at SQLs, only the person who wrote the SQL knows what it does and knows which part to rewrite to get a different result.

How come? I have no idea, but I know what would change it: Refactoring! In particular, Martin Fowlers take on refactoring (which I share):

“Why do I refactor? Because refactoring helps me understand the code. And understanding the code helps me do my job. So I refactor because it’s the fastest way to develop and do my job!”

(paraphrased by me)

If data people were to refactor frequently, every time they touch SQLs, then SQLs would improve in quality over time. Unnecessary parts would be removed, we would have every SQL at a world-class level, like the guys at Fishtown analytics like to say, and it probably would be tested, and contain comments where necessary.

So, read this book and get your data team to start refactoring often.

2 Working Effectively with Legacy Code, Michael Feathers

Why do I mention this book? Because it’s amazing, can be expressed in a nutshell, and because data teams in my experience tend to have a big problem with legacy code/systems/technologies.

“How to Work Effectively with Legacy Code” in a Nutshell: Michael Feathers defines legacy code as code without tests. That’s it, and I agree. To effectively work with legacy code, you simply have to wrap it, encapsulate it, decouple it, modularize it, do SOMETHING to make it testable. Then write a test. And finally, you can refactor (see book 1), expand, fix, and replace if necessary.

The good news: This applies not just to “code” but to code blocks, components, infrastructure, to every unit size you choose. That’s it. It’s as simple as that.

Example: f you have a huge SQL which you want to test, wrap it by letting it first write a table, then run a test on that, and then have a “select *” on that table. If you just want to test a part of it, cut that out and produce some temp tables there, just keep on refactoring…

3 Test-Driven Development by Example, Kent Beck

The story goes, the XP practices like Test-Driven Development (TDD) and Pair Programming were used in the very first SCRUM implementation. The only reason they didn’t make it into the official first SCRUM draft was that these practices seemed very extreme (as in XP!) and thus would make SCRUM less adoptable. That’s probably very true.

But also very sad, because I consider these two practices core to working effectively as a development team. And as you might’ve figured out, I consider every data team a full development team.

I know TDD is a stark contrast to how many data teams work, but I believe and have experienced, that it’s worth every inch of effort. I’ve spent way too much time working through the miracles of deployments and network magic on AWS before I realized that infrastructure could be developed in TDD and cut my development time in half.

I’ve spent way too much time wrestling ETL tools and tables before realizing that especially the transformation processes are very testable and development works way faster if you have to think about your output BEFORE you start developing.

And so it goes for every part of the data developer’s job, I think.

P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. But I tend to be opinionated. But you can always hit the unsubscribe button!

P.P.S.: This is also available as an e-mail newsletter “Three Data Point Thursday”.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s