Dremio Jekyll

Implementing a Self-Hosted Data Lake on AWS

Viktor Kessler

In my previous blog post, Next-Gen Data Analytics - Open Data Architecture, I discussed how the data architecture landscape has changed lately. The vast amount of data and the speed in which it is being generated makes it almost impossible to handle using traditional data storage and processing approaches.

For a successful data lake to meet the high demand of performance and efficiency required today, it must meet the following criteria:

  • Fast data access without complex ETL processes or cubes
  • Self-service data access without data movement or replication
  • Security and governance
  • An easily searchable semantic layer.

In this tutorial, I’ll show you how to create a self-hosted data lake on AWS using Dremio’s Data Lake Engine to work with it.

Let’s put an engine on your open data lake!

image alt text

Real-life implementation for self-hosted data lake

To create this implementation, I will use the following components:

  • Zenko Scality (S3 Interface)
  • Dremio
  • Docker
  • Python 3

On your node with pre-installed Docker, run Scality

1
docker run -d --name cloudserver -p 8000:8000 -e SCALITY_ACCESS_KEY_ID=AccessKey -e SCALITY_SECRET_ACCESS_KEY=SecretKey -e REMOTE_MANAGEMENT_DISABLE=1 zenko/cloudserver

Now let’s import some data in Scality via S3 interface and Python. We will upload airports (csv) and arrival delays (Parquet) data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import boto3
import os

path = '/Users/viktor/Dremio/data/'
bucket = 'flights'

client = boto3.client(
's3',
aws_access_key_id='AccessKey',
aws_secret_access_key='SecretKey',
endpoint_url='http://127.0.0.1:8000'
)


def loadData():

    client.create_bucket(Bucket=bucket)

    with open(path+ 'airport.csv', "rb") as fb:
        client.upload_fileobj(fb, bucket, 'airport.csv')

    files = os.listdir(path + 'AirportDelays/')
    for f in files:
        with open(path+'AirportDelays/'+ f, "rb") as fb:
            client.upload_fileobj(fb, bucket,'AirportDelays/'+f)    

if __name__ == "__main__":
    loadData()

In this next step, we will run Dremio:

1
docker run --quiet -p 9047:9047 -p 31010:31010 -p 45678:45678 dremio/dremio-oss

Once Dremio is up and running, we need to allow both containers to communicate with each other because both are deployed on the same node, for now. Therefore, we will create a network and add containers to this network.

1
2
3
4
docker network create scality-dremio
docker network connect scality-dremio [Dremio’s container ID]
docker network connect scality-dremio [Scality’s container ID]
docker network inspect scality-dremio

In the output under “Containers”.”Name”: “cloudserver” copy the IPv4Address.

1
2
3
4
5
6
7
8
"Containers": {
          "9a90436bf9...": {
                "Name": "cloudserver",
                "EndpointID": "46f50cab08726d4989...",
                "MacAddress": "02:42:ac:13:00:03",
                "IPv4Address": "172.19.0.3/16",
                "IPv6Address": ""
            },

Now we are ready to switch to the Dremio UI. Open your browser and go to localhost:9047. Initially, Dremio will ask you to create a new user. Fill in all of the required fields to create one.

image alt text

After you have registered, the Dremio UI will appear.

image alt text

Let’s add our Scality source to Dremio to be able to analyze data. Click “Add Source”. A new window will pop up, where we will choose Amazon S3.

image alt text

Now, type in the Name of the source, AccessKey and SecretKey, which you’ve defined by running Scality. Do not save for now. On the left menu bar, choose Advanced Options.

image alt text

Check the “Enable compatibility mode (experimental)” and add the new connection property: fs.s3a.endpoint : [your Scality IP address]:8000

image alt text

Finally, save the source settings. In the Dremio UI, you will see that a new source has been added. Click the source and you will see the bucket that we have previously created.

image alt text

Now select the bucket. Inside the bucket, we see one csv file and one folder( AirportDelays) that contains Parquet files.

image alt text

We want to make the dataset more understandable to our business users. To import the metadata, click click on the symbol below Actions.

image alt text

A new window will appear, which will recognize the Parquet format. Click the Save button. Dremio will format this file as a dataset, which is only a metadata representation of data. Dremio is not copying the data.

image alt text

Dremio will open a preview window of the AirportDelays data.

image alt text

Now repeat the same action with airport.csv. Go to Dataset, choose scality-s3, bucket flights and format the csv file as a dataset.

image alt text

As a result, we will have two physical datasets (PDS) in our bucket

image alt text

Now we can start analyzing this data. First, click on AirportDelays. In the SQL Editor, we can write different statements. Let’s count the number of rows. We have approximately 48 million rows.

image alt text

Then run a “select * from AirportDelays” statement. We only have the Airport codes and we are missing the information on the real city names. The city name information resides in the airport.csv file. We can easily join the information from csv to our AirportDelays Parquet-based data.

image alt text

For this Join action we can either write a direct sql statement or use Dremio’s UI wizard. Click Join, just above the OP_CARRIER column header. In a new window, search for airport.csv.

image alt text

Now define the type of join (Inner Join) and drag and drop the join condition ORIGIN and field E (contains the airport code).

image alt text

The result is joined from two different sources. To make a field of the joined dataset more readable, we can edit in the UI column C to City.

image alt text

Next, we will analyze the information about delays by date and city. Click on the button Group By, which is to the left of the Join button that we clicked previously.

image alt text

Drag and drop FL_DATE and City into the Dimensions area, and add some Metrics DEP_DELAY as count and sum.

image alt text

Now that we are happy with our dataset, we can save the result as a virtual data set (VDS). Still, no data has actually been copied from our data lake.

image alt text

Go to Datasets at the top of the page and in your space you will see the new “delays” dataset.

image alt text

Select the dataset. Now we want to analyze this directly in Tableau.

image alt text

A small file (.tds) with connection information has been downloaded. Open that file. Tableau will launch and prompt you for your login credentials. These are the credentials that we set up earlier.

image alt text

After successfully logging in, we can see our dataset. Tableau can now run directly on the data lake storage.

image alt text

image alt text

What we can see is that the reading of 50 million rows on S3-like storage took us about 12 seconds.

image alt text

If the response time is still not sufficient and needs to be faster, we can enable acceleration and build a Dremio Data Reflection.

image alt text

If you rerun your Tableau query, you will see that it now takes less than a second.

image alt text

Conclusion

We have built a data lake on an S3-compatible object store, Scality, and we enabled Data Engineers to easily wrangle and curate data. By following these steps, data consumers can query their data directly from the tools that they are familiar with without having to wait for complex data pipelines to be built to satisfy their data demands.

To learn more about Dremio, visit our tutorials and resources as well as Dremio University, our free online learning platform, where you can deploy your own Dremio virtual lab. If you have any questions, visit our community forums, where we are all eager to help.