The Future Of Good Data — What You Should Know Now!

“Leveraging exponential technology to tackle big goals and using rapid iteration and fast feedback to accelerate progress toward those goals is about innovation at warp speed. But if entrepreneurs can’t upgrade their psychology to keep pace with this technology, then they have little chance of winning this race.”

Peter H. Diamandis, Bold: How to Go Big, Create Wealth and Impact the World

Photo by Franki Chamaki on Unsplash.

Introduction

It’s still Day 1 for data. Companies, governments, non-profits around the world are already extracting a whole lot of value from data. But really, compared to the things that are coming, today is really just Day 1. Data is growing exponentially, as is our ability to extract knowledge from it. If you imagine the amount of data available to us as an apple, then by 2030, this apple has turned into a soccer ball. By 2050, it’s gonna be the size of an entire soccer field!

But what really does that mean for you and me? What exactly contributes to this kind of data growth? Will it mean the value we can extract from the data will grow linearly with the amount of data? As I was still a bit puzzled about the implications, I decided to take a short tour around the future data universe. This article provides what I personally think the future of data will bring, based on the growing amount of data, our ability to extract value from it, and what research tells us about it.

I do truly believe the future of data will shape any industry, period. If I take any sample industry, take a look at Porters five (or six) forces at work, and ask myself what the impact of data will be on them then it seems clear to me, that every single industry will look very different in 10 or 20 years, shaped by the impact of data. So for you, the question is not whether to grab this “opportunity” or not, but what you are going to do about it because if not, you will be disrupted by it.


Why Is This So Important Now?

Alright, so data is growing quickly. Big deal… Technology will adapt and our company will simply follow in on the lead.

Only, that it will be too late. Way too late.

Data is not just growing quickly, according to the current forecasts, it’s probably growing exponentially! To add to that, our ability to extract information from data, the computation power is growing exponentially as well!

That’s the problem with exponential growing technologies, you won’t notice until it’s too late.

“today we live in a world that is global and exponential. The problem is that our brains — and thus our perceptual capabilities — were never designed to process at either this scale or this speed. Our linear mind literally cannot grok exponential progression.”

Peter H. Diamandis, Bold: How to Go Big, Create Wealth and Impact the World

Let me make this even clearer: In 6 years, a single company will face four times the data it had available today. It will be able to get predictions on the data it has today 8 times faster/ cheaper. If the company analyses the data of today in 6 years, it will take them half the time it took them today. Most of the new data will be real-time, it will be event/ behavioral kind of data, not “state” data. A lot of the data will be in image form A lot of data will be on edge devices, a lot of computation will be done there as well. And to top that, a lot of freely available technologies to deal with all these changes will flood the open-source market.

Do you feel like you’re walking into a direction that will be able to tackle that world? If you got your answer, I guess it also answers the question “Why is this so important now?”

Image by Sven Balnojan.

Data & Computing is Growing Exponentially

The thing is, current data forecasts only cover parts of the future. But if you extrapolate the data, you will notice that the amount of data already is doubly roughly every 3 years and is forecasted to do so for the next 5 years. I do believe, that a lot of underlying forces will continue to carry this trend. The forces I see are mostly:

  1. The development of edge devices which are forecasted to reach a number of 41,6 billion connected IoT devices by 2025.
  2. Smartphone adoption rates will go close to 100%, currently, they are at 44%.
  3. Internet adoption rate will go close to 100%, currently, it is at 59% with large projects from Google & Facebook aiming to provide internet to the whole world.
  4. The exponential growth of a lot of underlying technologies like data storage, computation that produces data, cost of edge devices etc.. All of which are all essentially related to Moore’s Law.

The best forecast I could find is from SeaGate which you can find here. IDC SeaGate DataAge Whitepaper. Here’s what the data looks like if you fit an exponential to it.

That’s just the simple forecast by IDC & SeaGate. Image by Sven Balnojan.

Next, let’s look at the exponential continued to the year 2030.

Forecast by exponential fitting. Image by Sven Balnojan.

Finally, the really interesting question is, how will things possibly look like in 2050?

We’re in the Data Stone Age! Image by Sven Balnojan.

Notice this describes the data, not the data stored. But really do extract value from data it’s actually not that important to store it, but more to extract the information, make a decision, act on it and then discard the data. So I find this forecast to be a good source of the wave that is about to come. The important insights from the forecasts for me are:

  1. By 2030 we will have around 572 Zettabytes of data, which is round about 10 times more than today.
  2. By 2050 we will have 50,000–500,000 Zettabytes, which is 1,000–10,000 times bigger (forecast by exponential continuation).
  3. By 2025, more than 50% of the data will be on the edge
  4. By 2025, more than 50% of the data will be real-time, a trend that will continue to hit probably close to 90% by 2050
  5. By 2025, 80–90% of the data will be behavioral/transactional or what I would call it “event data”, a trend which again will probably stay at 90–100%.

On the other side of things is our ability to extract information from these massive amounts of data. As far as I can tell Moore’s Law is still going strong, although some people think it has hit the final roadblock along the way. Still, the density of microprocessors on a given chip has been doubling roughly every 18 months, providing us all with edge devices, smartphones, and amazing computers for bargain prices.

Another trend in the same direction is the development of GPUs, which from 1996 to 2020 actually did follow an exponential growth curve as well. GPUs are not just used to produce graphics, as graphics usually mean we need to do a lot of matrix multiplication and addition GPUs are optimized for just that. Turns out, this is exactly what today’s data analysis also needs, lots of matrix maths. In particular, all of the deep learning, AI, and machine learning fundamentally is about adding and multiplying matrices which is the reason our ability to extract information from data currently has a lot to do with the exponential growth of the computation power of GPUs.

But that is not all, the emergence of TPUs, chips specifically designed for machine learning will continue to push these trends further faster and faster. TPUs essentially do the same thing as GPUs but are much more energy-efficient thus pushing down the prices.

Finally, there’s the exponential growth of quantum computers, which can be measured by “Rose’s Law”, which has held pretty well for now 17 years. Rose’s Law says that the number of Qubits, the bits on a quantum chip, will double roughly every year. While quantum machine learning is still largely theoretical, I have no doubt that quantum data analysis will come into existence as an everyman’s tool in the next decade.

Event Data is a Gold Mine

In “classic” marketing & marketing automation there is the division between

  1. Socio-demographic data like your age, the place you live at, your income, etc.
  2. Behavioral data, like whether you bought certain items this year, whether you opened an e-mail or clicked on a link.

But I really think the distinction between “state and event” data is much better. Let’s define a “state” as

“The condition of any system at any point in time.” (That’s kind of the thermodynamic definition of state)

A reasonable definition of an event could be

“The transition of a system from condition A → condition B”

So if I’m the system, then me buying an item on amazon.com looks like this for a guy named Sven:

  • Today 12 am; Nothing bought yet; State: “non-buyer”; Events: none.
  • Today 1 pm; Just clicked on the checkout button on amazon.com State: “non-buyer”; Event: “Sven buys new item”
  • Today 1:01 pm; State: “buyer”

Now the fun part is, that of course, state data and event data are really equivalent. Kind simply two ways of looking at the world, because if I give you the series:

  1. “Here is the event at 1 pm today: Sven buys new item”, then you can tell me exactly what state Sven was before and after that event.
  2. “Here are the states: at 12 am Sven was a non-buyer, at 1:01 pm he was a buyer”, then you can also tell me, that Sven buys something at 1 pm.

So if the two kinds of data are equivalent, why does it matter? Because they are only equivalent in theory! In reality, you either:

  1. Don’t have the “state” data for any given point in time => thus are not able to get to the events (because you wouldn’t know that Sven was a non-buyer at 12 am)
  2. Don’t have all the events, but just a very small portion of it => are not able to tell the state, but just what really happened today.

Why is this important for us? Because a lot of companies have built large data analytics capabilities around state data, yet in 10 years, 99% of all data will be event data!

The only good part about it? A great study by Martens & Provost gives some hints that using event data is actually a highly profitable thing to do.

Computation & Data at the Edge

By 2025, more than 50% of the data will be collected on the edge. On 41,6 billion IoT devices, including only round about 6 billion smartphones. Much more will come from sensors, cameras, and the likes. Of course, your smartwatch will also be on this list. On top of that, edge devices without an internet connection will be out there. All of these things already today can do two things with data:

  1. Collect it, and send it to some central hub for storage/ evaluation. Given they have an internet connection. Like your smartphone which analyses your locally taken photos, sends them to a central hub and gets back a “collection of photos”, or annotates them, etc.
  2. Compute on this data locally, on the device. This is always a small challenge because these devices are much smaller in computational power than our usual cloud computer.

So what does that mean? For me, it implies, that both things will have an impact on future companies. First, as a company you will be faced with a lot of edge collected data that wants to be evaluated in a central place, stemming from suppliers, your customers, and other third-parties that will supply you with such kind of data.

Second, you as a company can take control of your own edge devices to collect data. Think about package & shipping tracking chips, wearable chips, chips, cameras, etc. There will be a lot of touchpoints for you to take more control of edge devices than before.

Third, with or without an internet connection, edge devices will want to compute to evaluate data themselves. This needs a shift in perspective because currently, most data analytics guys or machine learners are mostly focussed on the central evaluation of data on large computational devices. But machine learning is already possible inside any edge device. One of the most popular machine learning frameworks, tensor flow, is already available in javascript essentially allowing machine learning models to be trained & evaluated inside your browser on your smartphone or any other device.

Besides these three different directions in which data analytics will have to move, the kind of data involved will also pose a paradigm shift for a lot of companies.

Real-Time Data will be 50% and increasing

Large amounts of current data analysis are based on historical data, as well as “state” data as described above. Real-time data usually do not make it into the algorithms at all. If it does, it comes at the end of a large series of data, maybe with some “weights” to make current data more important than the historical one.

In stark contrast to that is the data sphere of the future, where the real and only important stuff is the real-time data because there will be so much of it! The reason why many companies resort to historical and state data is that in the past, we had next to no event kind of data. The only thing we knew about someone ordering at someplace was his socio-demographic data, which we probably had to buy somewhere.

But in the future, this dynamic will shift. It will be like image recognition. In the past, we needed 100+ images to have a computer tell us who the person on the image was. Today the computer is actually better than a human expert in this task, based on just one image.

So it will be in the future, we will be able to tell a person exactly from a day of event data, what he likes, what he will probably buy later, and so on. We won’t need any of the data like his income, his gender, etc. it will all be in the events, the real-time interactions of a single day.

And that again will pose a large paradigm shift for analytics people around the world. In the future the real value won’t be in collecting large amounts of specific data through time, it will be much more about collecting a large spectrum of data.

Image Data will be Huge

The simple truth is, most companies don’t think that image data will be important to their business ever, because there is no good “core product fit”. An e-procurement system will probably think that their complete interaction is through a website or some digital system, and the usage of image data is limited in that area. Better to leave that to people collecting laptop camera images, surveillance footage, etc.

I like to shed two different lights on the situation.

First, a company does not exist in a vacuum, I like to look at Porters five (or six) forces to understand industry dynamics:

Porter’s six forces. Image by Sven Balnojan.

Now the thing is, image data can possibly be integrated into the core of any of the six forces shaping your industry. If you’re an e-procurement system, your forces are:

  1. Competition/ New Entrants: There are traditional e-procurement systems that probably follow your thinking. How about competitors that only hold a small segment of your market? That serve more specialized parts and feel that VR/AR is a smart way to present pieces?
  2. Customers/ Substitutes/ Complementors: What if your customers suddenly can use an app on their smartphone (provided by someone else) to instantly recognize the thing they are looking for, and find an offer at some other marketplace?
  3. Suppliers: What if your suppliers suddenly start using image technology to form a deeper connection with their end-users, bypassing you because they provide superior customer service through an AI? What if suppliers can use edge data to quickly detect when an end-user has to reorder the same thing, like copy paper, or a spear part?

All of these things would be powered in their core by image data, all of them could disrupt this industry, which I’d consider to be far away from getting a core product fit with image data.

Second, the massive wave of image data available will not only bring the data but also technology to use this on a grande scale. Anyone will be able to do a lot of computation on images cheaply and easily, again changing dynamics. It will probably bring OCR to a level where no paper documents will be flying around. At least that’s what your customers & suppliers will expect. It will bring identification of objects on images to a pretty close to a perfect level (way beyond human level). And a lot of things that don’t come to my mind right now.

I’m not sure image data in itself will change any industry, but the same principle applies to real-time data, event data, and all the things that we discussed up to now. And I do believe not a single one of these topics should be discarded right away as together they will disrupt every industry.

All that’s left to ask is: Are you prepared?

Resources

Here are all the relevant resources I already shared throughout the text.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s