Dremio Jekyll

What Is AWS Lake Formation?

Data within the Data Lake

With the continuously growing volumes and varieties of data, data lakes have become the preferred storage option and a key component of the evolving data architecture. Data in cloud data lakes such as Amazon Simple Storage Service (Amazon S3) is stored in open formats, providing organizations with the flexibility to leverage best-of-breed services depending on their analytics or machine learning use case. Using data lakes as the central data repository enables simplified data management with all applications and services having access to the same datasets for analytics and reporting. AWS Lake Formation makes it easy to create and manage a data lake by providing the ability to automate ingestion and transformation of data in S3 and centralizing the management of data access controls that are then leveraged across the Amazon services used to derive insights from data.

AWS Lake Formation

AWS Lake Formation is a managed service that makes it easier to set up, secure and manage data lakes. The following are the main features of Lake Formation:

  • Help users discover S3 buckets
  • Create automated workflows to ingest, cleanse and transform raw data
  • Create and manage a data catalog for the data in S3
  • Ensure compliance with a centrally defined permissions model that enables fine-grained access that is enforced across the AWS services

AWS Lake Formation architecture with Dremio

Although these features are available through other Amazon services, Lake Formation acts as an end-to-end orchestration layer for creating and managing a data lake and securely providing access to data.

AWS Glue for Ingestion, Cleansing, Transforming and Cataloging Data

Lake Formation uses the AWS Glue service to execute the ingestion, cleansing, transforming and cataloging of data.

Lake Formation discovers existing S3 buckets and allows users to register their data sources that are running on Amazon RDS or hosted in Amazon EC2 or use JDBC for on-premises databases. Lake Formation provides blueprints, which are templates for predefined sources that allow users to easily ingest data into S3. Workflows, which consists of AWS Glue crawlers, jobs and triggers, are created by selecting a blueprint and providing a data source such as a relational database, the S3 bucket to use as the target location, and the frequency at which the data should be synced as input parameters. Blueprints automatically discover the source’s schema, convert the data to open format, and maintain a record of data that has already been imported and processed. The data gets read by crawlers and imported into S3.

The metadata of the datasets that are ingested using Lake Formation is stored in the same data catalog as AWS Glue. Along with making datasets easily discoverable, Lake Formation also allows users to create labels at a table and column level to identify attributes of data. This is particularly useful for marking sensitive data such as credit card or social security numbers. Lake Formation also uses AWS Glue for data cleansing and transformation capabilities. For example, users can trigger the AWS Glue FindMatches transformation to identify and link records that are conceptually the same but do not share an identifier.

Security Management

A key data management challenge when it comes to providing data access is maintaining security and governance. As mentioned before, using a data lake as the main data repository provides the flexibility to leverage a variety of analytics and machine learning services depending on the use case. This also introduces the challenge of making sure that users consistently see only the data they have access to regardless of the service that they’re using.

Lake Formation facilitates this for data in S3 by centralizing the management of data access and security policies across Amazon analytics and machine learning services. The Lake Formation permissions model is based on a simple grant/revoke mechanism and augments the access controls provided by AWS Identity and Access Management (IAM). Organizations can define table- and column-level access controls and enforce encryption for data at rest. Rules, defined for users and applications by role, are enforced at the table and column levels. Lake Formation also integrates with AWS IAM, so authenticated users and roles can automatically be mapped to policies created in Lake Formation.

Lake Formation also provides a comprehensive view of AWS CloudTrail audit logs across the Amazon analytics and machine learning services that are integrated with Lake Formation. This gives administrators a holistic view of what data is being accessed and by which users.

Consistent Data Access Across AWS Services

AWS services that access data reference Lake Formation to determine what data a user has access to so users will see a consistent set of data across all services. Lake Formation currently integrates with AWS Glue, Amazon Athena, Amazon Redshift Spectrum, Amazon QuickSight Enterprise Edition and Amazon EMR.

Since Lake Formation and AWS Glue share the same data catalog, AWS Glue users will only be able to access the databases and tables that they have permissions for in Lake Formation. Amazon Athena and Amazon Redshift both reference the AWS Glue catalog so they can only query databases, tables and columns on which they have Lake Formation permissions. Similarly, users that have their AWS Glue catalog registered as a data source in Dremio, are only able to query objects for which they have Lake Formation permissions. Amazon EMR and Amazon QuickSight both enforce permissions directly from Lake Formation.

Benefits of Using Lake Formation

Lake Formation is a service that consolidates the creation and management of data lakes with AWS Glue being leveraged for several of the capabilities, i.e., cataloging data and deduplicating data using FindMatches. However, the advantages of Lake Formation are the blueprints that simplify data ingest into S3 and the centralization of access controls.

Lake Formation provides blueprints, which are templates that provide flows to ingest data from various data sources. Instead of starting from scratch when developing ingestion workflows for different data sources, users can speed up this process by leveraging a blueprint.

The second benefit of Lake Formation is central permission management, which greatly simplifies security management in that administrators no longer need to grant access to data for each service. They can instead grant permissions to a user through Lake Formation, which then gives users a consistent view of data across the Amazon services that are integrated with Lake Formation as well as any third-party applications such as Dremio (through AWS Glue) or Tableau (through Amazon Redshift or Athena).
With these two capabilities, Lake Formation enables organizations to quickly get their data into S3 and centrally maintain data access controls, which are then transparently reflected when users leverage services such as Dremio for their analytics use case.