Without proper training, models and applications will fail. As a result, enterprises will pay increased attention to the information used to train artificial intelligence tools.
As interest in AI development surges, the underlying data used to train models and applications becomes increasingly important.
As a result, 2025 will likely be a year when enterprises place greater emphasis than ever on the basic tenets of proper data management -- how it's governed, stored, prepared and analyzed – according to Tony Baer, principal at dbinsight.
I see 2025 being the year of the renaissance of data. As AI projects get closer to production, enterprises will start to pay attention again to data.Tony BaerPrincipal, dbinsights
"I see 2025 being the year of the renaissance of data," he said. "As AI projects get closer to production, enterprises will start to pay attention again to data."
Enterprise interest in developing AI-powered applications has grown exponentially in the two years since OpenAI's launch of ChatGPT marked a significant improvement in the capabilities of generative AI (GenAI) models. Large language model technology has improved since then, with AI developers Anthropic, Google, Meta, Mistral and others all striving to top one another.
But without an organization's proprietary data, those models are of little use.
It's only when GenAI is combined with proprietary data and trained to understand an organization's operations that the models become useful. Only then can they deliver benefits such as smarter decision-making and improved efficiency that make generative AI such an attractive proposition for businesses.
So as enterprises invest more in AI development, they will also need to take steps to ensure that their data is properly prepared.
Data lakehouses and data catalogs will be front and center. So will data and AI governance. And accessing and operationalizing unstructured data will be critical.
In those organizations that have their data to properly develop AI-powered applications, traditional business intelligence will be transformed, according to Yigal Edery, senior vice president of product and strategy at Sisense.
"In 2025, AI will completely obliterate the boundaries of traditional BI, enabling anyone to develop and use analytics without specialized knowledge," he said. "Emerging AI-driven platforms will make analytics as intuitive as natural conversation, eliminating the need for clunky dialog boxes and complex interfaces."
Governance and preparation
Successful AI development begins with properly prepared data.
Without high-quality, well-governed data, AI projects will fail to deliver their desired outcomes. At the very least, improperly trained AI tools will lead to a lack of trust in their outcomes, which will result in them going unused. Of more concern is that they do get used, and decisions made on bad outputs lead to customer dissatisfaction, regulatory noncompliance and organizational embarrassment.
To better ensure that high-quality data is used to train AI tools, and to make that data discoverable, semantic models will become more popular in 2025, according to Jeff Hollan, head of applications and developer platform at Snowflake.
Semantics are descriptions of data sets that give them meaning and enable users to understand the data's characteristics. When implemented across an organization, semantic models ensure that the organization's data is consistent and can be trusted. In addition, they make data sets discoverable.
Traditionally, semantic models have been used to define the data sets that inform data products, such as reports and dashboards. Now, they can -- and perhaps should -- be used to define the data sets used to train AI models and applications.
"Investing in high-quality, well-governed semantic data models will become a top priority for organizations in 2025," Hollan said. "The growing adoption of AI-powered applications, chatbots and data agents highlights the critical need for curated models that organize and structure data effectively."
Enterprises that don't invest in semantic modeling often wind up trying to develop AI tools with fragmented data that leads to poor accuracy, he continued.
"As a result, this area is poised to see significant investment and innovation in tools, paving the way to fully realize AI's potential," Hollan said.
Effective data and AI governance will also help result in desired outcomes, according to Sanjeev Mohan, founder and principal of analyst firm SanjMo.
A data governance framework is a documented set of guidelines to determine an organization's proper use of data, including policies addressing who can do what with data, along with data privacy, quality and security. AI governance is an extension of data governance, applying the same policies and standards as data governance to AI models and applications.
"In 2024, most organizations are still grappling with picking the appropriate use cases and doing experimentations," Mohan said. "But as generative AI workloads become more pervasive, the need for AI governance will grow."
Like semantic models, catalog tools are a means of governing data and AI and making sure they're both high-quality and used effectively.
Catalogs are applications that use metadata to inventory and index an organization's data and AI assets -- including data sets, reports, dashboards, models and applications -- to make them discoverable for analytics and AI-driven analysis. In addition, they are where administrators can put governance measures in place, including access controls.
Given their key capabilities, they will only grow in importance as the use of AI tools increases. However, because many enterprises use open table formats such as Apache Iceberg and Delta Lake to federate data across multiple systems, the catalogs will need to be open as well, according to James Malone, Snowflake's head of data storage and engineering.
"It's already clear that an open table format can't truly exist without an open catalog," he said. "In the coming year, I expect all open catalog solutions to prioritize federation between each other because customers simply don't have the time or resources to constantly switch and migrate between catalogs. Every catalog provider will need to offer seamless federation to win in the market."
Storage and development
One of the keys to developing successful AI applications is the amount of data used to train them.
Models that aren't trained with enough data are prone to hallucinations -- incorrect and sometimes even bizarre outputs -- whereas those trained with an appropriate amount of data are more likely to be accurate.
How much data is needed for proper training depends on the use case. Narrow use cases naturally require less data than broader ones. But even applications developed for hyper-specific use cases need to be able to draw on enough data to prevent them from making up responses when there's not enough data to inform a proper response to a query.
Enter unstructured data.
Historically, analytics has focused largely on structured data, such as financial records and point-of-sale transactions. However, structured data now makes up less than 20% of all data. Unstructured data such as text, audio files and images make up the rest. For AI models and applications to be accurate, enterprises need to access their unstructured data and use it to inform their AI tools.
Many data management vendors have added capabilities such as vector search and retrieval-augmented generation to make unstructured data discoverable and actionable.
But to truly get value from unstructured data, enterprises need to start treating it with the same care as they do their structured data, according to Mohan.
"All the best practices for structured data need to be applied to unstructured data like modeling, security and access governance," he said. "Unstructured data needs to become a first-class citizen, just like its structured data brethren."
Meanwhile, as unstructured data gains increased importance, so will the ways in which it is stored.
Traditional data warehouses store mainly structured data. As unstructured data became more ubiquitous, data lakes were developed to provide organizations with a repository for text and other forms of data that lack a traditional structure.
However, with structured data in one location and unstructured data in another, the two were isolated from one another. Organizations either had to go through painstaking labor to unify them or leave them isolated and accept that any analysis would be based on data sets made up of only some of their pertinent information.
Data lakehouses, first developed about 10 years ago, are a hybrid of data lakes and warehouses, enabling structured and unstructured data to be stored together. Engineers still need to use methods such as vectorization -- the algorithmic assignment of a numerical value -- of unstructured data to make it compatible with structured data, but at least the two aren't isolated from one another in lakehouses.
Toward that end, with unstructured data essential to AI development, Mohan said he expects lakehouses to continue gaining popularity. But not just any lakehouses.
Just as increasing use of open table formats will result in increased use of open catalogs, open table format-based lakehouses will gain popularity, according to Mohan. AWS' recent introduction of S3 Tables and S3 Metadata will help fuel the trend.
"Open table format-based lakehouses will become the de facto analytical approach," he said.
In addition, the preferred open table format will become Apache Iceberg, Mohan continued.
"Apache Iceberg will increase in its prominence at the cost of Delta format," he said.
Open table formats won't be the only open source capabilities that gain popularity in 2025 and spur adoption of tools that support open source, according to JB Onofre, principal software engineer at Dremio and a member of the Apache Software Foundation's board of directors.
Instead, an increased emphasis on interoperability between systems and a corresponding fear of vendor lock-in will drive widespread open source adoption.
"Projects that support hybrid architectures and are extensible across diverse environments will thrive," Onofre said. "In particular, we'll see open source communities focusing on AI-ready data, developing tools that not only democratize access but also ensure data governance and security meet enterprise-grade standards."
Read the full story from Eric Avidon, via TechTarget.