8 minute read · April 26, 2022

How eMAG Created BI Infrastructure in Days with Dremio’s Open Lakehouse Platform

Heather Mackey · Marketing Director, Dremio

The Choice: A Traditional Data Warehouse or a Modern Data Stack on AWS?

Determining Sources and Targets

Extracting and Ingesting Data

Orchestration with Apache Airflow for Customizable Flexibility

Dremio for Lightning-Fast Queries on S3 Using Apache Iceberg

First Dashboards in Production in Days with Tableau and Dremio

Best Practices and Lessons Learned

To get new analytics initiatives off the ground quickly, many data teams today face a defining question. Should they go with the traditional data warehouse systems that they know? Or should they take a modern approach with an open lakehouse? To support a new business initiative, online retailer eMAG seized the opportunity to rapidly create a BI infrastructure on AWS using Dremio’s open lakehouse platform.

Fast-growing eMAG is an e-commerce company based in Romania. It is part of the multinational Naspers Group, one of the largest technology investors in the world. Freshful is the newest addition to the eMAG group, the first exclusive online retailer in Romania, delivering fresh groceries from the company’s warehouse in under two hours.

The Choice: A Traditional Data Warehouse or a Modern Data Stack on AWS?

To be successful, the new business needed analytics infrastructure that would make multi-source data available for analysts. The team considered taking a familiar approach: using a Microsoft SQL data warehouse (which they had substantial experience with), SQL Server Integration Services (SSIS) as the ETL tool, and QlikView as a reporting tool. The drawbacks to this approach were several:

Very high costs
Inability to harvest cluster processing power
Low scalability (data was growing so fast, traditional tools weren’t able to keep up)

Challenged to think outside the box, data and analytics lead Marius Costin and cloud data engineer Bogdan Miclaus decided to go with a modern data stack in the cloud, even though it would mean a learning curve for the team. The Freshful app and website were hosted on Amazon Web Services, making Amazon S3 the natural choice for cloud data storage. For the team, the benefits to this approach included:

The opportunity to learn and use new and exciting technologies
Using open source tools and open standards
Low cost at the beginning of the startup
High scalability
Eliminating infrastructure maintenance

The team knew that it wanted a customizable solution with a lot of flexibility. It wanted to get up and running quickly. Here is the process it went through and the choices that Costin and team made along the way. Watch the story presented by Marius Costin of eMAG during Subsurface LIVE Winter 2022.

Determining Sources and Targets

The first step was to determine Freshful’s data sources, how it would be extracted, and where it would be stored. The list included:

Traffic data generated by the Freshful app and web site. The source was Google BigQuery and the target was Apache Parquet files in S3 buckets.
Master and financial data generated by the ERP system. The source was SAP and the target was Parquet files in S3 buckets. This data was exported into RabbitMQ queues.
Order data generated by the app and web site. The source was Amazon RDS, the target was Parquet files in S3 buckets. This was also queried directly using Dremio for real-time purposes.

Extracting and Ingesting Data

After the team determined the data sources and where the data should be stored, they had to figure out how they could extract the sources, load them, and transform them.

To bring the raw data from the sources to the staging S3 bucket, the eMAG team used Python scripts with Pandas. They executed Python custom scripts on an Apache Airflow cluster to import data into the staging layer. They chose this option because it offered flexibility - although they may revisit the choice as data volumes grow.

For transformation, the team split workloads into big and small. For the small workloads, they use Python scripts with Pandas to take the raw data that was ingested into the staging S3 bucket before processing it and moving it to the reporting S3 bucket. Scripts read files from the staging bucket, deduplicate the data and write the data into S3 partitioned files.

For big workloads, they went with Spark on Amazon EMR. They launch an EMR cluster from Apache Airflow then submit a Spark cluster that reads the files from the staging S3 bucket, calculates and aggregates the data and writes the result in a Parquet file using Apache Iceberg tables. They also have an Airflow Sensor to check the status of the job to see if it finished successfully

Orchestration with Apache Airflow for Customizable Flexibility

Next, the team needed something to schedule and orchestrate their workflows so that they would run at a given time and in a given order.

They decided to go with Amazon Managed Workflows for Apache Airflow (MWAA), using an Airflow cluster on Amazon and creating Airflow DAGs (directed acyclic graphs) that run scripts for importing tables and processing them.

Dremio for Lightning-Fast Queries on S3 Using Apache Iceberg

The team needed to read the orders in real time. In order to read the data from S3, provide fast queries and enable data discovery for power users, the eMAG team needed a query engine.

The team chose Dremio to go over the S3 reporting layer. Dremio provides eMAG with lightning-fast queries directly over S3 using the Iceberg table format. The team considered various other solutions and technologies - including Amazon Athena, Redshift Spectrum, Hive, and Impala - but the team preferred Dremio. Its focus on SQL made it easy for the team to use and implement business logic.

First Dashboards in Production in Days with Tableau and Dremio

With all the data processed and loaded, the team needed a way to show it to the world. For visualization, they chose Tableau. It connects directly to the Dremio cluster to run queries, has great visualizations, is customizable, user-friendly, and highly dynamic.

Once everything was set up, Costin and team were able to decrease the time they spent on ETL and processing the data so they could spend more time creating dashboards. The team got Dremio into production within 28 days. The timeline from creating the infrastructure with Dremio to dashboards in production took less than a month.

Best Practices and Lessons Learned

As Costin put it in his remarks at Subsurface, “We like Dremio because we know a lot of SQL.” With their deep familiarity with SQL, the team found Dremio simple to use. Simplicity ties in to best practices and lessons learned nearly a year after the project’s start date. “Don’t overcomplicate things,” Costin advised.

To learn more, read the eMAG case study and get started with Dremio’s open lakehouse platform.

Article Topics

Dremio Blog: Open Data Insights

How eMAG Created BI Infrastructure in Days with Dremio’s Open Lakehouse Platform

Table of Contents

The Choice: A Traditional Data Warehouse or a Modern Data Stack on AWS?

Determining Sources and Targets

Extracting and Ingesting Data

Orchestration with Apache Airflow for Customizable Flexibility

Dremio for Lightning-Fast Queries on S3 Using Apache Iceberg

First Dashboards in Production in Days with Tableau and Dremio

Best Practices and Lessons Learned

Ready to Get Started?

Table of Contents

The Choice: A Traditional Data Warehouse or a Modern Data Stack on AWS?

Determining Sources and Targets

Extracting and Ingesting Data

Orchestration with Apache Airflow for Customizable Flexibility

Dremio for Lightning-Fast Queries on S3 Using Apache Iceberg

First Dashboards in Production in Days with Tableau and Dremio

Best Practices and Lessons Learned

Additional Resources

The Why and How of Using Apache Iceberg on Databricks

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

5 Use Cases for the Dremio Lakehouse

Ready to Get Started?