h3h3h3h3h3

5 minute read · January 14, 2021

Building a Better Data Lake on Amazon S3 with Dremio

Louise Westoby

Louise Westoby · Head of Product & Partner Marketing, Dremio

When it comes to storing large datasets, cloud-based data lakes are where the action is. The Amazon Simple Storage Service (S3) has emerged as a preferred data lake platform for good reasons. S3 is secure, scalable, flexible and offers excellent performance. It is also attractive because it allows users to only pay for the amount and quality of storage they need. S3-based object stores can store virtually any data. With optimized column-oriented file formats such as Apache Avro and Apache Parquet, improved metadata management and SQL-oriented query tools, it is increasingly practical to run SQL queries directly against S3 resident data.

Traditional Data Lake Solutions Fall Short

Unfortunately, even with SQL query tools such as Hive and Presto, data lakes still fall short for many applications. This is especially true for business intelligence (BI) and decision support systems (DSS). There are two key limitations:

  1. Data lake query performance is far too slow to support popular reporting and analysis tools such as Tableau and Power BI
  2. Traditional data lake solutions often lack necessary data governance and security controls

To work around these limitations, users often find themselves extracting data subsets from the data lake and replicating it in a data warehouse. Extracting, transforming and loading datasets into a format where they can be queried efficiently usually requires corporate IT assistance. This slows time to insight, adds costs and frequently undermines the benefit that the data lake was meant to achieve. It also complicates data governance because sensitive data is replicated into ungoverned data extracts, cubes and aggregation tables.

Business analysts and data scientists struggle to find the right balance between investments in the data lake and the data warehouse. Data lakes are scalable and cost-effective but lack the query performance and data governance features of a data warehouse. Ideally, enterprises would like tools that can query and analyze data in S3 directly at interactive speed, without having to copy data into other systems or compromise on performance or security.


Building a Modern Architecture for Interactive Analytics on Amazon S3 Using Dremio

Download the White Paper


Cloud Data Lake Engines Offer a Better Alternative

Fortunately, a new breed of cloud data lake engine can help organizations avoid this trade-off for both transactional and non-transactional workloads. Enabled by new open source technologies, the Dremio cloud data lake engine delivers lightning-fast queries. It provides a 100x improvement in BI query speed and a 4x improvement in ad hoc query speed running against S3 object storage and metadata solutions such as AWS Glue or the Hive metastore.

The cloud data lake engine also provides a self-service semantic layer enabling data analysts and engineers to easily manage, curate and share virtual datasets. Datasets are exposed via standard interfaces, and access is managed via centralized data governance and security policies. The semantic layer implements data governance features similar to a full-featured data warehouse, including granular row- and column-based access controls, data masking, encryption, auditing and more. A cloud data lake engine can access data from multiple sources, including other data lakes, file storage and various relational and non-relational data stores, providing a unified view of data assets to data scientists and business analysts.

By using a cloud data lake engine, organizations can realize multiple benefits:

  1. Reduced cost by avoiding the need to extract data into separate data warehouses or aggregation tables to meet BI and data science application requirements
  2. Faster time to insight by avoiding reliance on corporate IT to implement ETL workflows and provide suitable data extracts
  3. Improved data security and governance with centralized data access controls regardless of the underlying data source
  4. Improved productivity and collaboration between BI and data science users with a common view of enterprise data

Dremio – Built for Cloud Data Lakes

With a flexible multi-engine architecture scalable from one to thousands of nodes, the Dremio cloud data lake engine takes advantage of the AWS cloud’s underlying elasticity. It maximizes concurrency and performance and dramatically reduces infrastructure costs by scaling engines based on workload. The cloud data lake engine is easily deployable on Amazon S3. It also works seamlessly with other AWS data management solutions such as Amazon RDS, Amazon Redshift and other data sources. For enterprise users who want to ensure flexibility and portability, Dremio runs on premises and across multiple clouds. It can be deployed using AWS CloudFormation, in Kubernetes pods, Docker containers or Apache Hadoop environments.

Learn More

Dremio’s cloud data lake engine can help organizations strike the right balance between data lake and data warehouse investments while simultaneously reducing cost and complexity. It does this while helping enterprises avoid vendor lock-in, data duplication and enabling users to keep full control of their data. Download our free whitepaper Building a Modern Architecture for Interactive Analytics on Amazon S3 Using Dremio to learn how to get started or visit Dremio.com.

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.