How data governance must evolve to meet the generative AI challenge

Data governance was on my mind recently, so I decided to query ChatGPT by entering the prompt: “What is data governance?” The AI responded with: “Data governance is a set of processes, policies, standards, and guidelines that ensure data is properly managed, protected, and utilized within an organization.” That’s a good start, and there is a lot more to say about data governance and its meaning at this moment.

Data governance in the age of generative AI

Data governance covers a range of disciplines, including data security, management, quality, and cataloging. The practice requires defining usage policies, creating master data sources, profiling data sets, documenting dictionaries, and overseeing data lifecycles. An organizational model often defines roles for the chief data officer facilitating a strategy, data owners who set policies on data sets, and data stewards responsible for improving data quality.

“Data governance is a critical element of data integrity, allowing organizations to easily find, understand, and leverage critical data—leading to accurate reporting and informed decisions,” says Tendü Yogurtçu, PhD, chief technology officer at Precisely. “It provides an understanding of data’s meaning, lineage, and impact, so businesses can stay compliant and ensure that AI models are fueled with trustworthy data for reliable outcomes.”

Yogurtçu says that data governance was once a technical undertaking focusing on compliance. ”With increased adoption of AI, data has become the most vital corporate asset, and data governance should be an enterprise-wide priority,” she says.

For many organizations experimenting with genAI or building applications with large language models (LLMs), there are greater data governance responsibilities, more risks from how employees use AI tools, and new scope from unstructured data. I consulted with several experts on how data governance must evolve to meet the opportunities and risks inherent in generative AI tools and capabilities.

4 ways to evolve data governance for genAI

Review data policies for use in genAI tools and LLMs
Accelerate data quality initiatives
Review data management and pipeline architectures
Extend data governance to genAI workflows

Review data policies for use in genAI tools and LLMs

Data governance departments oversee data catalogs and communicate data usage policies to help employees tap into centralized data sets and use them for building machine learning models, dashboards, and other analytics tools. These departments are now updating policies to include whether and how to use enterprise data sources in LLMs and open genAI tools. Developers and data scientists must review these policies and consult with data owners on any questions about using data sets to support genAI experimentation.

“With generative AI bringing more data complexity, organizations must have good data governance and privacy policies in place to manage and secure the content used to train these models,” says Kris Lahiri, co-founder and chief security officer of Egnyte. “Organizations must pay extra attention to what data is used with these AI tools, whether third parties like OpenAI, PaLM, or an internal LLM that the company may use in-house.”

Review genAI policies around privacy, data protection, and acceptable use. Many organizations require submitting requests and approvals from data owners before using data sets for genAI use cases. Consult with risk, compliance, and legal functions before using data sets that must meet GDPR, CCPA, PCI, HIPAA, or other data compliance standards.

Data policies must also consider the data supply chain and responsibilities when working with third-party data sources. “Should a security incident occur involving data that is protected within a certain region, vendors need to be clear on both theirs and their customers’ responsibilities to properly mitigate it, especially if this data is meant to be used in AI/ML platforms” says Jozef de Vries, chief product engineering officer of EDB.

For those excited about genAI opportunities, it’s important to have a first-things-first mindset by understanding their organization’s data privacy, security, and compliance policies.

Accelerate data quality initiatives

Many companies offer data quality solutions, including Attacama, Collibra, Experian, IBM, Informatica, Precisely, SAP, SAS, and Talend. The global data quality tools market size was valued at over USD 4 billion in 2022 and is expected to grow 17.7% annually. I expect higher growth now that many companies are experimenting with AI tools and LLMs.

“Since artificial intelligence is only as good as the data that fuels it, the many challenges of working with AI are connected to data quality,” says Mateusz Krempa, COO at Piwik Pro. “Poor data quality can lead to misleading or erroneous insights, seriously affecting the outcomes.”

Krempa says that data quality challenges stem from the volume, velocity, and variety of big data, especially since LLMs now tap into the organization’s unstructured data sources. Companies looking to develop internal LLMs will need to extend data quality initiatives to include information extracted from documents, collaboration tools, code repositories, and other tools storing enterprise knowledge and intellectual property.

“Data governance is shifting gears not just to feed LLM systems with tons of data, but to do it wisely and safely,” says Karen Meppen, data governance lead at Hakkoda. “The focus is on ensuring the data is not just big, but smart—accurate, understandable, privacy aware, secure, and respectful of the risks and impacts of intellectual property and fairness.”

Data quality can be improved using different tools, depending on the business goals and data types.

Traditional data quality tools can deduplicate, normalize data fields, validate data against business rules, detect anomalies, and compute quality metrics.
Master data management tools (MDM) help organizations connect multiple data sources and create a source of truth around business entities such as customers and products.
Customer data platforms (CDP) are specialized tools for centralizing customer information and enabling marketing, sales, customer service, and other customer interactions.

Expect upgrades and new data quality tools to improve support for unstructured data sources and increase data quality capabilities for genAI use cases.

Another recommendation from Graeme Cantu-Park, CISO of Matillion, focuses on the importance of data lineage. “AI will require a completely different way of looking at governance priorities and practices to have better visibility into the data pipelines and data lineage that feeds AI applications and models.”

Data lineage helps expose the data’s lifecycle and answer questions about who, when, where, why, and how data changes. Because AI expands the scope of data and its use cases, understanding data lineage becomes more important to more people in the organization, including people in security and other risk management functions.

Review data management and pipeline architectures

Looking beyond policies and data quality, data governance leaders must extend their influence into data management and architecture functions. Proactive data governance enables a set of capabilities so that more employees can leverage data, analytics—and now AI—to do their jobs and make smarter decisions. How data is stored, accessed, productized, cataloged, and documented are all factors in how quickly, easily, and securely organizations will be able to extend their data into genAI use cases.

Hillary Ashton, chief product officer of Teradata, suggests the following ways to make the most exciting AI use cases a reality:

Create reusable data products, or curated sets of known good data, to help the organization better control and instill trust in its data.
Respect data gravity to make information accessible to more people within the workforce without moving data across different environments.
Pilot AI initiatives with scalability in mind, including AI/ML data pipelines with robust governance that also enables an open and connected ecosystem.

A key for data teams is to identify frameworks and platforms that are easy to use and support multiple use cases. Sean Mahoney, general manager and VP at Ensono says, “Governance frameworks are starting to look more agile to allow teams to respond more quickly to the pace of tech advancements.” He suggests that data governance leaders also review and get involved in these tools:

Data meshes for delegating the management of the data to those creating it.
Vector databases to handle scalability and complexity inherent in generative AI and LLMs.
Real-time monitoring tools to expand data governance across more systems.

Another consideration is how data governance, management, and architecture require understanding global regulations on data storage. EDB’s de Vries recommends, “Enterprises should implement globally distributed databases to elevate their data governance practices by keeping highly regulated data within its region while distributing less restrictive data globally for agility when feeding into AI platforms.”

Extend data governance to genAI workflows

Data governance functions must also consider how using genAI tools and LLMs requires policies and best practices. For example, at the beginning of this article, I explicitly quoted ChatGPT so that readers knew the response came from a genAI source. Good data governance calls for educating employees on procedures to increase transparency, the tools they are permitted to use, and practices that minimize data privacy issues.

“The biggest thing I am seeing is the rise of ways to accurately leverage, share, and learn from data while maintaining privacy and authenticity,” says Deon Nicholas, CEO of Forethought. “For example, LLM-based search engines like Perplexity always cite their sources, or data redaction technologies like Private AI that enable you to scrub and redact PIl before ingesting or sending data to LLMS.”

One new, proactive measure data governance leaders should consider is creating prompt libraries where employees can record their prompt use cases and share them across the organizations. This discipline extends the knowledge management practices that many data governance teams already do around maintaining data catalogs and data dictionaries.

Nikolaos Vasiloglou, VP of Research ML at RelationalAI, says, “The fuel of LLMs consists of a mix of clean and well-curated content stored usually in a knowledge graph along with expert knowledge that is typically in the form of prompt libraries. While we have good governance practices for knowledge graphs, how to govern the latter is not obvious.”

I love the quote popularized in the Spiderman movie, “With great power comes great responsibility.” We’re seeing a rapid evolution of genAI capabilities, and the question is whether data governance teams will step up with their side of the equation.

READ SOURCE