In my previous blog post, Next-Gen Data Analytics - Open Data Architecture, I discussed how the data architecture landscape has changed lately. The vast amount of data and the speed in which it is being generated makes it almost impossible to handle using traditional data storage and processing approaches.
For a successful data lake to meet the high demand of performance and efficiency required today, it must meet the following criteria:
- Fast data access without complex ETL processes or cubes
- Self-service data access without data movement or replication
- Security and governance
- An easily searchable semantic layer.
In this tutorial, I’ll show you how to create a self-hosted data lake on AWS using Dremio’s Data Lake Engine to work with it.
Let’s put an engine on your open data lake!
Real-life implementation for self-hosted data lake
To create this implementation, I will use the following components:
On your node with pre-installed Docker, run Scality
docker run -d --name cloudserver -p 8000:8000 -e SCALITY_ACCESS_KEY_ID=AccessKey -e SCALITY_SECRET_ACCESS_KEY=SecretKey -e REMOTE_MANAGEMENT_DISABLE=1 zenko/cloudserver
Now let’s import some data in Scality via S3 interface and Python. We will upload airports (csv) and arrival delays (Parquet) data.
import boto3
import os
path = '/Users/viktor/Dremio/data/'
bucket = 'flights'
client = boto3.client(
's3',
aws_access_key_id='AccessKey',
aws_secret_access_key='SecretKey',
endpoint_url='http://127.0.0.1:8000'
)
def loadData():
client.create_bucket(Bucket=bucket)
with open(path+ 'airport.csv', "rb") as fb:
client.upload_fileobj(fb, bucket, 'airport.csv')
files = os.listdir(path + 'AirportDelays/')
for f in files:
with open(path+'AirportDelays/'+ f, "rb") as fb:
client.upload_fileobj(fb, bucket,'AirportDelays/'+f)
if __name__ == "__main__":
loadData()
In this next step, we will run Dremio:
docker run --quiet -p 9047:9047 -p 31010:31010 -p 45678:45678 dremio/dremio-oss
Once Dremio is up and running, we need to allow both containers to communicate with each other because both are deployed on the same node, for now. Therefore, we will create a network and add containers to this network.
docker network create scality-dremio
docker network connect scality-dremio [Dremio's container ID]
docker network connect scality-dremio [Scality's container ID]
docker network inspect scality-dremio
In the output under “Containers”.”Name”: “cloudserver” copy the IPv4Address.
"Containers": {
"9a90436bf9...": {
"Name": "cloudserver",
"EndpointID": "46f50cab08726d4989...",
"MacAddress": "02:42:ac:13:00:03",
"IPv4Address": "172.19.0.3/16",
"IPv6Address": ""
},
Now we are ready to switch to the Dremio UI. Open your browser and go to localhost:9047. Initially, Dremio will ask you to create a new user. Fill in all of the required fields to create one.
After you have registered, the Dremio UI will appear.
Let’s add our Scality source to Dremio to be able to analyze data. Click “Add Source”. A new window will pop up, where we will choose Amazon S3.
Now, type in the Name of the source, AccessKey and SecretKey, which you’ve defined by running Scality. Do not save for now. On the left menu bar, choose Advanced Options.
Check the “Enable compatibility mode (experimental)” and add the new connection property: fs.s3a.endpoint : [your Scality IP address]:8000
Finally, save the source settings. In the Dremio UI, you will see that a new source has been added. Click the source and you will see the bucket that we have previously created.
Now select the bucket. Inside the bucket, we see one csv file and one folder( AirportDelays) that contains Parquet files.
We want to make the dataset more understandable to our business users. To import the metadata, click click on the symbol below Actions.
A new window will appear, which will recognize the Parquet format. Click the Save button. Dremio will format this file as a dataset, which is only a metadata representation of data. Dremio is not copying the data.
Dremio will open a preview window of the AirportDelays data.
Now repeat the same action with airport.csv. Go to Dataset, choose scality-s3, bucket flights and format the csv file as a dataset.
As a result, we will have two physical datasets (PDS) in our bucket
Now we can start analyzing this data. First, click on AirportDelays. In the SQL Editor, we can write different statements. Let’s count the number of rows. We have approximately 48 million rows.
Then run a “select * from AirportDelays” statement. We only have the Airport codes and we are missing the information on the real city names. The city name information resides in the airport.csv file. We can easily join the information from csv to our AirportDelays Parquet-based data.
For this Join action we can either write a direct sql statement or use Dremio’s UI wizard. Click Join, just above the OP_CARRIER column header. In a new window, search for airport.csv.
Now define the type of join (Inner Join) and drag and drop the join condition ORIGIN and field E (contains the airport code).
The result is joined from two different sources. To make a field of the joined dataset more readable, we can edit in the UI column C to City.
Next, we will analyze the information about delays by date and city. Click on the button Group By, which is to the left of the Join button that we clicked previously.
Drag and drop FL_DATE and City into the Dimensions area, and add some Metrics DEP_DELAY as count and sum.
Now that we are happy with our dataset, we can save the result as a virtual data set (VDS). Still, no data has actually been copied from our data lake.
Go to Datasets at the top of the page and in your space you will see the new “delays” dataset.
Select the dataset. Now we want to analyze this directly in Tableau.
A small file (.tds) with connection information has been downloaded. Open that file. Tableau will launch and prompt you for your login credentials. These are the credentials that we set up earlier.
After successfully logging in, we can see our dataset. Tableau can now run directly on the data lake storage.
What we can see is that the reading of 50 million rows on S3-like storage took us about 12 seconds.
If the response time is still not sufficient and needs to be faster, we can enable acceleration and build a Dremio Data Reflection.
If you rerun your Tableau query, you will see that it now takes less than a second.
Conclusion
We have built a data lake on an S3-compatible object store, Scality, and we enabled Data Engineers to easily wrangle and curate data. By following these steps, data consumers can query their data directly from the tools that they are familiar with without having to wait for complex data pipelines to be built to satisfy their data demands.
To learn more about Dremio, visit our tutorials and resources as well as Dremio University, our free online learning platform, where you can deploy your own Dremio virtual lab. If you have any questions, visit our community forums, where we are all eager to help.