Dipankar Mazumdar

Subsurface Session

Apache XTable (incubating): Interoperability among lakehouse table formats

Apache Hudi, Iceberg, and Delta Lake have emerged as leading open-source projects, providing decoupled storage with powerful primitives for transaction and metadata layers, commonly referred to as table formats, in cloud storage. When data is written to a distributed file…

Read more ->

Guides

What Is a Data Lakehouse?

As the name suggests, a data lakehouse architecture combines a data lake and a data warehouse. Although it is not just a mere integration between the two, the idea is to bring the best out of the two architectures: the reliable transactions of a data warehouse and the scalability and low cost of a data […]

Read more ->

Blog Post

Getting Started with Flink SQL and Apache Iceberg

Apache Flink is an open source data processing framework for handling batch and real-time data. While it supports building diverse applications, including event-driven and batch analytical workloads, Flink stands out particularly for streaming analytical applications. What gives it a solid edge with real-time data are features such as event-time processing, exactly one semantics, high throughput, […]

Read more ->

Gnarly Data Waves Episode

Gnarly Data Waves: Apache Iceberg Office Hours

Get all your Apache Iceberg questions answered at Apache Iceberg office hours. Questions on architecture, migration and anything else are welcomed!

Read more ->

1200x628 Streamlining Data Quality in Apache Iceberg

Blog Post

Streamlining Data Quality in Apache Iceberg with write-audit-publish & branching

Data quality is a pivotal aspect of any data engineering workflow, as it directly impacts the downstream analytical workloads such as business intelligence and machine learning. For instance, you may have an ETL job that extracts some customer data from an operational source and loads it into your warehouse. What if the source contains inconsistent […]

Read more ->

Gnarly Data Waves Episode

Data as Code with Dremio Arctic: ML Experimentation & Reproducibility on the Lakehouse

In this episode of Gnarly Data Waves, we will discuss how Dremio Arctic and data as code enable data science use cases like Machine learning experimentation and reproducibility on a consistent view of your data in a no-copy architecture.

Read more ->

Gnarly Data Waves Episode

What’s New in the Apache Iceberg Project: Version 1.2.0 Updates, PyIceberg, Compute Engines

In this episode of Gnarly Data Waves, Dremio’s Developer Advocate, Dipankar will highlight some of the key new capabilities that have been added to the Apache Iceberg project in the version 1.2.0 along with discussions around compute engines & the…

Read more ->

Blog Post

Introducing the Apache Iceberg Catalog Migration Tool

Catalogs in Apache Iceberg In the Apache Iceberg world, a catalog is a logical namespace that contains information to fetch metadata about the tables. A catalog acts as a centralized repository that allows managing tables and their versions, facilitating operations such as creating, updating, and deleting tables. Most importantly, the catalog holds the reference to […]

Read more ->

Blog Post

Exploring Branches & Tags in Apache Iceberg Using Spark

Apache Iceberg 1.2.0 release brings in a range of exciting new features and bug fixes. The release centers around changes to the core Iceberg library, compute engines together with a couple of vendor integrations, making the ecosystem of tools and technologies around the ‘open’ table format extremely robust. Amongst some of the noteworthy features is […]

Read more ->

Gnarly Data Waves Episode

Apache Iceberg Office Hours: Gnarly Data Waves

Get all your Apache Iceberg questions answered at Apache Iceberg office hours. Questions on architecture, migration and anything else are welcomed!

Read more ->

DremioBlog Dealing With Data Incidents 2

Blog Post

Dealing with Data Incidents Using the Rollback Feature in Apache Iceberg

Imagine you are a data engineer working for the platform engineering team of your company’s analytics team. Your responsibilities include building data pipelines and infrastructure to make data available and support analytical workflows such as business intelligence (BI) and machine learning (ML) across your organization. In the past, your analytical workloads used to run on […]

Read more ->

Gnarly Data Waves Episode

Optimizing Data Files in Apache Iceberg: Performance strategies

Optimized query speed is a must when processing 100s of petabytes of data on the data lake, especially when data grows over time. Join Dremio’s Developer Advocate, Dipankar Mazumdar as he walks through the various performance strategies available in Apache…

Read more ->

Data Lakehouse

What Is a Data Lakehouse? A data lakehouse combines the performance, functionality and governance of a data warehouse with the scalability and cost advantages of a data lake. With a data lakehouse, engines can access and manipulate data directly from data lake storage without copying data into expensive proprietary systems using ETL pipelines. Learn more […]

Read more ->

1200x628 Webinar Apache Iceberg Office Hours 4 1

Gnarly Data Waves Episode

Dive into Data Waves with Apache Iceberg Office Hours

Get all your Apache Iceberg questions answered at Apache Iceberg office hours. Questions on architecture, migration and anything else are welcomed!

Read more ->

Gnarly Data Waves Episode

Migrating a BI Dashboard to your Data Lakehouse with Apache Superset and Dremio

Dashboards are the backbone of an organization’s decision-making process. Join Dremio Developer Advocate Dipankar Mazumdar to learn how to easily migrate a BI dashboard (Apache Superset) to your data lakehouse for faster insights.

Read more ->

Blog Post

5 Easy Steps to Migrate an Apache Superset Dashboard to Your Lakehouse

Every organization considers dashboards a key asset to support their decision-making process. Now, as organizations invest more and more in their data strategy, they constantly focus on making dashboards self-serviceable. The idea is to let any level of user, irrespective of their technical expertise, have access to these reports and be able to answer critical […]

Read more ->

2022 11 29 DremioBlog MachineLearning Arctic 2

Blog Post

Managing Data as Code with Dremio Arctic: Support Machine Learning Experimentation in Your Data Lakehouse

Experimentation in Machine Learning Unlike the software engineering field, which is usually backed by established theoretical concepts, the world of machine learning (ML) takes a slightly different approach when it comes to productionizing a data product (model). Like with any new scientific discipline, machine learning leans a bit more toward the empirical aspects to determine […]

Read more ->

DremioBlog Notebook Nessie Iceberg SparkODA

Blog Post

A Notebook for getting started with Project Nessie, Apache Iceberg, and Apache Spark

Trying out any new project with dependencies and integrating a couple of technologies can be a bit daunting at first. However, it doesn’t have to be that way. Developer experience is super critical to everything we do here in the Dremio Tech Advocacy team. So, through this Notebook, the idea is to simplify configurations, etc., […]

Read more ->

Apache Arrows rapid growth over the years 2

Blog Post

Apache Arrow’s Rapid Growth Over the Years

As co-creators of Apache Arrow, here at Dremio it’s been really exciting over the past several years to see its tremendous growth, bringing more usage, ecosystem adoption, capabilities, and users to the project. Today Apache Arrow is the de facto standard for efficient in-memory columnar analytics that provides high performance when processing and transporting large […]

Read more ->

Webinars

A Hands-On Look at the Structure of an Apache Iceberg Table

Read more ->

Blog Post

Puffins and Icebergs: Additional Stats for Apache Iceberg Tables

Puffin is here in Apache Iceberg The Apache Iceberg community recently introduced a new file format called Puffin. Hold on. We have Parquet, ORC. Do we really need another file format, and does it give us additional benefits? The short answer is Yes! Until now, we had two ways of gathering statistics for efficient query […]

Read more ->

Blog Post

How Z-Ordering in Apache Iceberg Helps Improve Performance

This tutorial introduces the Z-order clustering algorithm in Apache Iceberg and explains how it adds value to the file optimization strategy.

Read more ->

Building a Tableau Dashboard directly on the data lake with Dremio

Blog Post

Building a Tableau Dashboard Directly on the Data Lake with Dremio

A hands-on tutorial for building a Tableau dashboard directly on the data lake using Dremio.

Read more ->

Blog Post

A Hands-On Look at the Structure of an Apache Iceberg Table

This tutorial provides a practical deep dive into the internals of Apache Iceberg using Dremio Sonar as the engine.

Read more ->

Problems with Monolithic Data Architectures and Why Data Mesh Is a Solution

Blog Post

Problems with Monolithic Data Architectures & Why Data Mesh Is a Solution

Over the past few years, more and more enterprises have wanted to democratize their data to make it more accessible and usable for critical business decision-making throughout the entire organization. This created a significant focus on making data centrally available and led to the popularization of monolithic data architectures. In theory, with monolithic data architectures […]

Read more ->

Guides

What Is a Data Mesh?

Data mesh is a decentralized approach to data management that focuses on domain-driven design (DDD). It aims to bring data closer to business units or domains, where people are responsible for generating, governing, and treating the data as a product. A Data Mesh is an architectural approach to designing data-driven applications. It provides a way […]

Read more ->

Blog Post

The Origins of Apache Arrow & Its Fit in Today’s Data Landscape

This blog post features the history behind Apache Arrow and how it addresses modern challenges in today’s data landscape.

Read more ->

Dipankar Mazumdar