Dremio Jekyll

What is Data Lineage?

For organizations that depend on data, understanding where data comes from, evaluating its quality and determining its accuracy is essential to supporting the business. Data lineage refers to the data’s “line of descent.” In other words, a record of how data got to a specific location and the intermediate steps and transformations that took place as it traveled through business systems. Data lineage essentially provides a map of the data journey that includes all steps along the way, as illustrated below.

“Data lineage is a description of the pathway from the data source to their current location and the alterations made to the data along the pathway.”

Data Management Association (DAMA)

Related to data lineage is the concept of data provenance. Data provenance refers to the source of the data. From the provenance, you can make assumptions about that data's trustworthiness and quality.

Both data warehouse and data lake administrators need to be concerned about tracking data provenance and data lineage. Understanding when and where data originated, who touched it, and how data was modified are critical aspects of metadata management.

Why Does Data Lineage Matter?

Understanding the provenance and lineage of data sources is valuable for several reasons:

  • Evaluating the trustworthiness of data based on its provenance
  • Understanding and correcting sources of error
  • Identifying incorrect assumptions about data that may skew analysis
  • Providing audit trails for data governance and regulatory purposes
  • Ensuring data flows are protected and not subject to tampering
  • Identifying and avoiding data duplication to simplify operations and reduce cost

Organizations need visibility to how information moves through various workflow steps to ensure the quality of query results, business reports, business intelligence (BI) dashboards and training sets. Data quality is enhanced when data engineers can track who made a change and why, how something was updated, and which process was used.

Data Lineage Delivers Significant Business Value

While data lineage may seem like an abstract concept, having full visibility to data through its lifecycle can bring value to the business across multiple areas:

Improve business performance

Better quality data means better analysis and business results

Manage regulatory compliance

Reduce the cost of compliance with existing and future regulation

Handle evolving data sources

Build an agile data infrastructure that is responsive to change

Reduce IT cost and risk

Reduce the cost of application maintenance and new development

Improve business performance – Modern enterprises rely on BI and decision support systems (DSS) for almost every decision they make. Examples include features to prioritize in new product designs, where to place advertisements, and what sales and marketing strategies to employ to maximize revenue, profitability and customer loyalty. The phrase "garbage in, garbage out" applies to all aspects of analytics. Getting the data wrong can seriously skew the results and undermine business performance.

Manage regulatory compliance and risk – Organizations across all industries need to deal with a variety of regulatory requirements. Some regulatory requirements affect only specific industries. Examples are HIPAA, designed to protect patient information in the healthcare sector, and the Basel Accord meant to mitigate risk in international banking. Other regulations, such as the European Union's General Data Protection Regulation (GDPR), affect all industries. Having metadata that tracks data lineage for data governance purposes reduces risk and compliance-related costs for the business. It also makes it easier and more cost-effective to comply with new regulations that may be introduced in the future.

Better handle of evolving data sources – As business conditions evolve, systems and data sources change constantly. For example, an analytic application that evaluates customer behavior by looking only at traditional point of sale data is almost sure to be wrong. This method of analysis will have missed e-commerce orders, in-app purchases, and customers across a variety of other sales channels and demographics. While this seems obvious, data bias and undetected problems with data sources is a problem that even the most sophisticated organizations can easily fall into.

Reducing IT cost and risk – What all of the examples above have in common is their dependence on information technology (IT). Organizations that have visibility to datasets and how they are being used can more easily build new applications, and address issues with existing applications more quickly and cost-effectively. Modifying or adding to an analytic application is far easier and cost-efficient if the sources of data are clear from their metadata.

Managing Data Lineage

In data lake environments, managing data lineage is especially critical. Data lakes contain diverse datasets, in different formats that come from a wide variety of sources. For example, data lakes may contain images, video files, log files, documents, raw text or files in formats such as JSON, CSV, Apache Parquet or Optimized Row Columnar (ORC) formats. Also, datasets in the data lake are continually being added to, often at a rapid pace, and a variety of tools may access and process this raw data, resulting in additional derivative datasets.

When these issues of variety and velocity are coupled with the sheer volume of data, it becomes impossible for individuals to manually track the origins and details about every data item. In data lake environments, metadata management must be automated.

When managing data lakes, metadata management is a particular concern. Unlike the data itself stored in the data lake, metadata is “the data about the data.” Metadata can take a variety of forms. For example, technical metadata may contain supplementary information about the data type, format and structure (schema). Business metadata may contain information about business objects and descriptions. Operational metadata typically contains information about data processing critical to tracking data lineage.

Learn About Data Lakes

Tools for Metadata Management

A more powerful data catalog and metadata solution is needed to have any hope of tracking and managing the vast and diverse datasets in a data lake.

First-gen, on-prem data lake environments such as Apache Hadoop have basic features for cataloging data schemas (the Hive Metastore as an example) but lack broader tools for metadata management that can be used to track data lineage.

Cloud data lake stores such as Amazon S3 and Azure Blobs/Azure Data Lake Storage (ADLS) suffer from similar limitations. Solutions such as the AWS Glue Catalog can track data in the context of specific application sources, but most applications involve data from many sources.

Next-Generation Metadata Management

To manage data lakes effectively, data engineers and architects need metadata management tools that enable them to create a semantic layer on top of physical data. Furthermore, users and applications should be able to access data through this abstraction layer. Ideally, a metadata management solution should provide the following capabilities:

  • A standard interface through which BI tools and notebooks can directly query data in the data lake (ODBC, JDBC, etc.)
  • Role-based access controls (RBAC) integrated within enterprise directories to control who can view and modify data
  • Row and column-based access controls governing data from any source
  • Data masking to hide sensitive information in specific fields
  • Encryption to ensure that data cannot be intercepted or tampered with by unauthorized individuals
  • Auditing to track the integrity of data assets and track lineage through the data lifecycle

Data Lake Engines and Data Lineage

Ideally, enterprise customers need their data lake engine to support the same metadata management capabilities to track data lineage as the enterprise data warehouse (EDW). They need to be able to establish logical views into data via a semantic layer on top of physical data. Data engineers and data architects also need an intuitive way to manage, curate and share data securely without the overhead of physically copying data. For modern data lake engines, all of these capabilities are essential to meeting enterprise requirements related to compliance and data governance.

In traditional data warehouse environments, data scientists and business analysts often extract data from the warehouse to create intermediate files such as data extracts or OLAP cubes. This data may be moved into ungoverned systems such as BI tools, spreadsheets or external databases. By providing fast, direct access to data through logical views, a modern data lake engine can avoid the need to create these ungoverned data extracts, thus improving overall data security and keeping data where it can be properly managed.

To be effective, the provenance and lineage of every dataset handled by the data lake engine needs to be carefully tracked. This includes all relationships, virtual datasets, transformations and queries so that users have full visibility to where each virtual dataset came from. Role-based access controls are needed to ensure that only authorized users have access to relevant datasets, and all data access performed through the data lake engine needs to be fully auditable.

Additional Resources