Get Started Free
No time limit - totally free - just the way you like it.Sign Up Now
To get new analytics initiatives off the ground quickly, many data teams today face a defining question. Should they go with the traditional data warehouse systems that they know? Or should they take a modern approach with an open lakehouse? To support a new business initiative, online retailer eMAG seized the opportunity to rapidly create a BI infrastructure on AWS using Dremio’s open lakehouse platform.
Fast-growing eMAG is an e-commerce company based in Romania. It is part of the multinational Naspers Group, one of the largest technology investors in the world. Freshful is the newest addition to the eMAG group, the first exclusive online retailer in Romania, delivering fresh groceries from the company’s warehouse in under two hours.
To be successful, the new business needed analytics infrastructure that would make multi-source data available for analysts. The team considered taking a familiar approach: using a Microsoft SQL data warehouse (which they had substantial experience with), SQL Server Integration Services (SSIS) as the ETL tool, and QlikView as a reporting tool. The drawbacks to this approach were several:
Challenged to think outside the box, data and analytics lead Marius Costin and cloud data engineer Bogdan Miclaus decided to go with a modern data stack in the cloud, even though it would mean a learning curve for the team. The Freshful app and website were hosted on Amazon Web Services, making Amazon S3 the natural choice for cloud data storage. For the team, the benefits to this approach included:
The team knew that it wanted a customizable solution with a lot of flexibility. It wanted to get up and running quickly. Here is the process it went through and the choices that Costin and team made along the way. Watch the story presented by Marius Costin of eMAG during Subsurface LIVE Winter 2022.
The first step was to determine Freshful’s data sources, how it would be extracted, and where it would be stored. The list included:
After the team determined the data sources and where the data should be stored, they had to figure out how they could extract the sources, load them, and transform them.
To bring the raw data from the sources to the staging S3 bucket, the eMAG team used Python scripts with Pandas. They executed Python custom scripts on an Apache Airflow cluster to import data into the staging layer. They chose this option because it offered flexibility – although they may revisit the choice as data volumes grow.
For transformation, the team split workloads into big and small. For the small workloads, they use Python scripts with Pandas to take the raw data that was ingested into the staging S3 bucket before processing it and moving it to the reporting S3 bucket. Scripts read files from the staging bucket, deduplicate the data and write the data into S3 partitioned files.
For big workloads, they went with Spark on Amazon EMR. They launch an EMR cluster from Apache Airflow then submit a Spark cluster that reads the files from the staging S3 bucket, calculates and aggregates the data and writes the result in a Parquet file using Apache Iceberg tables. They also have an Airflow Sensor to check the status of the job to see if it finished successfully
Next, the team needed something to schedule and orchestrate their workflows so that they would run at a given time and in a given order.
They decided to go with Amazon Managed Workflows for Apache Airflow (MWAA), using an Airflow cluster on Amazon and creating Airflow DAGs (directed acyclic graphs) that run scripts for importing tables and processing them.
The team needed to read the orders in real time. In order to read the data from S3, provide fast queries and enable data discovery for power users, the eMAG team needed a query engine.
The team chose Dremio to go over the S3 reporting layer. Dremio provides eMAG with lightning-fast queries directly over S3 using the Iceberg table format. The team considered various other solutions and technologies – including Amazon Athena, Redshift Spectrum, Hive, and Impala – but the team preferred Dremio. Its focus on SQL made it easy for the team to use and implement business logic.
With all the data processed and loaded, the team needed a way to show it to the world. For visualization, they chose Tableau. It connects directly to the Dremio cluster to run queries, has great visualizations, is customizable, user-friendly, and highly dynamic.
Once everything was set up, Costin and team were able to decrease the time they spent on ETL and processing the data so they could spend more time creating dashboards. The team got Dremio into production within 28 days. The timeline from creating the infrastructure with Dremio to dashboards in production took less than a month.
As Costin put it in his remarks at Subsurface, “We like Dremio because we know a lot of SQL.” With their deep familiarity with SQL, the team found Dremio simple to use. Simplicity ties in to best practices and lessons learned nearly a year after the project’s start date. “Don’t overcomplicate things,” Costin advised.