Posted on

Understand what Good Data looks like and quickly discover Bad Data

Good Data vs. Bad Data. Good Data, derives the data strategy from the company strategy, feeding into the datacisions cycle. Bad Data has lots of “initiatives” flying around the company, without a coherent data strategy. Image by the author.

In 1990 the Virginia based bank “Signet Bank” decided to trust two smart people, Richard Fairbanks and Nigel Morris, and make a major investment into data. They decided to turn the customer credit department into a large laboratory, “testing” out different kinds of credit terms on different credit taker characteristics and thus collected data for years.

This was a huge investment, and the department “lost money” for quite some time. But what they were really doing was acquiring data, not just because they thought it’s a good investment, what Fairbanks and Morris collected was Good Data. Data integrated into a good data strategy aligned with the company strategy.

They collected data with the clear focus of improving the decision making capability of the credit department of Signet Bank.

Good data is just that. Good data is what you have with a good data strategy. It is data you collect, clean/enrich/transform, make insightful, with the sole goal of improving decision making.

The problem only is, out there is a lot of Bad Data! You will see bad data everywhere you go. It is data touched for any other reason, without having a larger goal in mind.

I’m not sure it’s fair to judge data just on this scale, but in my experience, it seems to be the only good measure. The only value in data is to improve the decision making capability of a company and thus help it take better actions. I do think this is a lesson that’s already been understood by many great thinkers. For instance, Provost & Facett make that point in “Data Science for Business: What you need to know about data mining and data-analytic thinking”.

Fairbank & Morris understood this, applied it, and turned the little “Signet Bank” into Capital One, one of the largest banks in America. David Velez, co-founder of the billion-dollar Brazilian unicorn Nubank also intuitively understood this from the beginning. He is now translating his key insights into other markets around the world.

Now let’s see what Bad Data looks like and then walk through these two examples of Good Data.

I’ve written this article to help you see the dramatic differences between Good Data and Bad Data. To help you see through the forest of techniques that can be applied in both contexts. To give you the chance to reexamine your data strategy and turn Bad Data into Good Data before your competition does it.

Bad Data

There are a great many examples of Bad Data. Bad Data happens when you try to let the data shape the data strategy, not the other way around. Fairbanks and Morris didn’t hire a hundred data scientists to “come up with something great” or collected lots of random data throughout the company. They collected data with the goal of establishing a profitable credit division and then hired upon data scientists to use & collect the right data and support this strategy.

And albeit some evidence that some tech companies actually do simply “hire smart people to come up with something great” and are successful with it, I do feel that usually, they do have a data strategy behind it to back that up.

Bad Data is ….

… when a company invests in building up a data lake to “get insights from the data” without understanding whether anyone in the company will actually use this data to make better or faster decisions than before.

… when a company hires some data scientists to “take advantage of big data”, without having actual outcomes in mind.

…when a company hires a bunch of machine learning engineers & data scientists to “come up with something great”, without integrating them into their company vision.

…when a company decides to build a “data mesh” because the central data collecting team is having problems, without understanding whether this will result in better actions takes by its employees.

…when an analytics department produces report upon report because people ask for them, without understanding what people do with these reports differently than what they did before.

…when a machine learning team spends a quarter building the latest recommendation engine for the website, without thinking about whether a simple compiled “Top items” list would do just as fine.

…when a marketing department installs the latest marketing automation tool, without already having lots of e-mail campaigns and data in place.

…when a company spends a quarter to upgrade to the latest version of “AwesomeDataTool.X” to “follow the pace of technology”, without understanding how people use their AwesomeDataTool to decide.

… it’s not Good Data. Good Data is derived from a data strategy, integrated into the company strategy ….

Good Data at Signet Bank (and then Capital One)

All the techniques mentioned above resulting in Bad Data can be used to result in Good Data. And maybe in your case they are. But to get to see if that is true, you’ve got to ask why, why, and why.

To get Good Data, you have to start from the top. Start with the data strategy, the company strategy really. You have to start with the goal of taking better actions, making better decisions, and then work our way down the road. What does this road look like? Nowadays I tend to think about the road of data really like the flow used by ThoughtWorks:

Read more about this cycle at ThoughtWorks Intelligence Enterprise Series, Image by the author.

So the stages, if we start with our goal “better actions” are:

  • Taking better actions
  • Making the decision to take a specific action
  • Having the insights available to make a specific decision
  • Having the information to derive insights.
  • Having collected the data to transform it into information.
  • (Having actions emit data to let it be collected… which makes the circle full.)

If we start to analyze the Signet Bank case, we start with the data strategy. In this case, it’s “collecting & experimenting with credit loans to get enough information to skim off the profitable credits larger banks won’t offer”.

This was a pretty unique perspective at the time. In the 1980s the credit market was revolutionized by automatic default probability calculations. As such, credit offerings were kept at a standard rate for the people with low default probability. But in 1990 Fairbanks and Morris decided it was time to bet on both, price discrimination, or in other terms offering different credit terms to different people and a focus on profitability, not just default probability.

So in essence they decided that in the mass of people with a higher default probability there were still profitable loans out there! And in the mass of people with low default probability, there were more profit options available as a lot of these people were actually a losing bet for the credit departments.

So we can derive the action, the action Signet Bank was taking and wanted to improve as…

Image by the author.

Action — “Offer more profitable loans”. Just as Fairbanks and Morris identified they need to focus on both profitability & default probability. The decision in question then becomes …

Decision — “What kind of customer do I offer what kind of loan?”. In the 1980s this decision was simply made based on the insight that some people have higher default probabilities and some lower. What Signet Bank now sought for was a different insight…

Insight — “What specific customers should I provide with what specific kinds of loan conditions?”. To derive these insights we need more information. Unfortunately, this information wasn’t available at all in 1990. The only information provided was the default probability based on socio-demographic data….

Information — “Hard data on how profitable different customer sets are, together with traditional defaulting probabilities and data on what kinds of customers are served by other banks, what conditions they offer!!”. Because all of this information flows into the decision whom to serve with what product package. So finally it comes down to the question of data, raw data which simply wasn’t available….

Data — “New data, experimental data on different sets of conditions offered to people with different socio-demographic backgrounds and their respective profitability AND default probability”. So for Signet Bank, it turned out, the Good Data strategy was to collect this kind of data, combine it with the other data sources to walk back up the cycle again, and in turn make better decisions.

Of course, afterward, this wheel keeps on turning, adding more value to each stage. Capital One is reported to run thousands of variations of this kind each year. For a deeper dive into the case, look at “Data Science for Business: What you need to know about data mining and data-analytic thinking” by Provost and Fawcett, 2013.

This case is already quite old, and companies tend to not share this kind of data. But I’ve recently stumbled across another related case from the finance industry, that of the Brazilian unicorn “Nubank”.

Good Data at Nubank

I knew Nubank for their machine learning framework fklearn. The company is creating lots of open source in the realm of machine learning problems. But as it turns out, Nubank ain’t simply investing in machine learning because everyone else in the industry does it, the investment into machine learning and algorithms is deeply tied to the companies strategy and the resulting data strategy.

Nubank is breaking into the “unbanked market”, which in Brazil is around 50% of the population. These people simply don’t have access to the banking system. Nubank is changing that by offering products targeted at that specific group. The first product, a mobile-only managed credit card aims to provide bank access & small loans to the unbanked at a speed previously unknown in the Brazilian finance industry, and more products are following in that path.

But serving the “unbanked” turns out to come with an obvious problem: How do you know if an unbanked will default, or be profitable for the company? After all, they have no banking history or financial information like credit scoring.

This is where the data strategy becomes apparent. Nubank’s founder David Velez describes to some extend how they realized exactly that problem and that they had to focus on unconventional ways of looking and collecting their data.

Let’s start from the top: Provide the unbanked, with access to a credit card with reasonable fees and still be profitable. That’s the action we want to take. The data strategy thus turns into “Collect & analyze data to derive which people to offer which kinds of products, the credit card, loan or a debit card“…

Image by the author.

Action — “Which customer should be offered which product? With what kinds of credit limits? To stay profitable.”. After all, just like Signet Bank, Nubank is tackling a group of customers from which the traditional banking system in Brazilian shies away. The key decisions are decisions like….

Decision — “Which specific customer can we offer the Mastercard credit card, to whom can we offer a loan and at which rates?”. To make such decisions Nubank needs…

Insights & information — “the profitability of a customer as well as the default probability. But being a fast-growing start-up they also need the likelihood of people recommending the bank to others just as the possible usage of additional products like the loyalty program.”.

And that’s exactly what Nubank builds and is continuing to develop, according to Velez in a CNN interview:

“Nubank, however, has built its business on a wholly new foundation: unique data sets and algorithms that are based on “a lot of nontraditional information,” Vélez said.”


”We look at where you live…how you move, who your friends are, who invited you to Nubank, the type of people that you’re sending money to,” he said. “We look at whether you read the contract of the credit card or whether you don’t — it turns out that people [who “read” the contract] really fast tend to be fraudsters. We look at the type of transactions that you’re doing, if you’re buying groceries or if you are in a bar.”

Indeed, if you look closely Nubank is in a particularly great spot to exploiting these kinds of information because they also collect a lot of data traditional companies do not collect! They are already on your phone, have a recommendation program in place thus collecting key behavioral data.

Indeed this is underpinned by a study by Martens & Provost in 2011 which basically says: for banks figuring & using the information where you live and how old you are, are a good starting point, but using behavioral data like the ones mentioned here provides a substantial lift in profitability and increases the more data is used, whereas the sociodemographic data hits a clear roof.

You should also notice that Nubank is in the unique position to run experiments very similar to the one Signet Bank conducted! They currently provide 50% of the newly issued credit cards in Brazil, meaning they control a huge share of the unbanked market. By experimenting with the unbanked market they can derive very similar insights based on the data points they currently collect. And they can do that better than anyone else.

Whether that’s enough to stay profitable remains to be seen, but at least it is enough to turn Nubank into a billion-dollar start-up and let them expand around the world.

If you want to make a difference in your company with the use of data, I hope this helps you discern Bad Data from Good Data and put you on the right path, focusing on data strategy and not on letting data dictate a “strategy”. Stay tuned. There’s more Good Data to come.


Leave a Reply