Anyone who’s played with generative AI tools like ChatGPT or Dall-E knows the future is approaching us — fast. But for data nerds like me, I’m not just excited about using AI tools in my day-to-day work at Inventure. I’m excited about the revolution this will kick off with massive new growth in the data industry.

Why? AI isn’t a magic wand you can wave over any data. The hard truth is that the AI you implement is only as intelligent as the data you feed it. The CEOs playing with ChatGPT today need to realize that to truly unlock this next generation of growth, they need to shore up their data stack and truly invest in their data management.

There’s a lot of data for businesses to sort through. The rule of thumb is that each year, as much data was created as the past three years combined. This growth has led the data management market to grow to over €100 billion. Now we predict a big new wave driven by solutions helping businesses unlock the full value of the recent progress in AI and give birth to the next generation of intelligent businesses. More specifically, solutions tackling three unsolved problems in the data space. Let’s dive a bit deeper!

1) Garbage in, garbage out

The record growth of ChatGPT shows the initial big impact the large language models have already had. These large language models will continue to have significant influence on businesses, already seen in everything from coding, written content creation, to design. But one thing that is not always understood is that these models are no different from other language models, they only know what they know based on the data they were fed. (But it sure is massive amounts of data. GPT4 is allegedly trained on 100 trillion data points!)

One problem is the data quality. Take, for example, the stable diffusion model that can produce all sorts of cool images from text. Quite a few of these images unfortunately came with the Getty Images watermark stamped on them. Many large language model companies still lack quality assurance and scalable data testing methods before training a new model. Quality will be an important differentiator to win in this ongoing race. It will also be vital to avoid bias and ensure compliance regarding immaterial rights and regulations.

Last month I spoke with founder and angel investor Patrik Liu Tran, who believes the data and the quality of it will be a key differentiator in the AI race:

“High quality data is one of the keys to unlocking an unfair advantage in the AI race. Therefore, I believe that the companies that manage to figure out good ways to monitor and ensure the quality of the data that goes in and comes out of the models will be well-positioned to create defensible positions”.

Adding to the complexity of this “quality” problem is the explosion in unstructured data, or data not organized in a pre-set data model. Last year it was estimated 90% of all data created was unstructured, and its rate is growing at 60% per year. This will be a massive problem for businesses to tackle to truly leverage the data they collect.

Another important differentiator will be the access to proprietary data and the ability to unlock the power of large models — but built on your specific collection of data. This is necessary to add actual relevant business intelligence to your line of work and create a sufficient moat. Inventure portfolio company Hopsworks can unlock this for companies, their founder Jim Dowling explains:

“A ML-model that does not generate predictions on new data does not generate value. You need feature pipelines to create new data for both training and inference. Something we set out to solve.”

In the next decade, we at Inventure believe companies will be spending billions on reinforcement learning to solve these unsolved problems. This will include data management at scale, for example with solutions like:

  • Anomaly detection in streaming data.
  • Performance monitoring of large language models.
  • Solutions for enabling easy use of data contracts, which could become a key piece of the data quality puzzle.
  • Metadata management, tackling quality and relevance from unstructured data.
  • Vector databases, adoption of these databases could dramatically improve search and could be a new essential infrastructure disrupting the database space.
  • Synthetic data as a compliment to real data, able to control quality and faster get to critical quantity of relevant data points.

2) The cost of cloud

The explosion of creating and gathering data is becoming a costly problem for companies and for the planet. In a study of public companies in the US, a16z estimates potential savings from cloud costs could be as much as $500b. As one example, between 2015 and 2017, Dropbox managed to increase their gross margins from 33% to 67% by moving from a public cloud to a lower cost and custom-built infrastructure.

It is not just companies paying the price of the data explosion. Our planet also suffers as the emissions of data centers now reach similar levels as the aviation industry. With the exponential growth in data and extensive use of AI transforming every industry, this will only get worse unless more innovation comes to market.

We look to invest in solutions tackling this costly problem:

  • Software that enables data centers to transform and operate in a more circular and efficient manner, moving to zero emissions.
  • The next generation of data tools making storing data more efficient. We have already invested in UltiHash and are eager to see what other solutions will tackle the dark data problem (companies are estimated to store as much as 90% dark data).
  • Companies enabling data management in the edge or hybrid management of decentralized and central cloud data storage.

3) The bottleneck of data scientists and engineers

One of the most pressing bottlenecks for faster data innovation is the talent shortage in this space. Gartner’s study shows that the lack of data skills was the number one challenge business leaders needed to solve for their digital transformation. Making things worse is the inefficient use of data scientists, stuck with too much manual work, such as transforming data, annotating data, and validating data, still taking up around 80% of data scientists’ time.

We believe this problem offers ample investment opportunities and we are proactively looking for companies using machine learning to automate manual tasks in data management. From the Nordics, there are interesting companies like SuperAnnotate in annotation, Validio for automatic data quality validation and Grafbase simplifying backend building and unifying the data layer.

When investing in the next generation of DataOps, we want to see companies be able to handle large streaming volumes at speed, analyze data in real-time, and simplify the large complexity in data management. As Lars Nordwall, COO at Neo4j, notes,

“There is a huge opportunity for someone to become the first ‘Youtube’ of data management — making it easy to manage large volumes of complex data streams just like the Youtube team addressed the impossible of handling real-time video streaming for the first time.”

Machine learning will also enable the next generation of AI-powered developer tools. Solutions that will make it possible to cut down on the number of junior developers you need and that automate some of the boring time-consuming tasks of developers.

We strongly believe that, in many cases, you should be able to skip the need of data scientists and junior developers by empowering the business users. Actionable data is often the missing link between big data and business value.

We look to invest in the next generation of data tools and dev tools with great no-code functionality and outstanding UI enabling the marketer or product manager to be able to access and build on data as easy as building a website today. This operationalization of data is the natural evolution in data management, making the step of visualizing an unnecessary detour and giving rise to embedded analytics and frictionless creation of data products.