Building a Cloud Data Lake on Azure with Dremio and ADLS
ADLS Gen2 is a second-generation blob storage service provided by Azure, bringing together the features of ADLS Gen1 and Azure Blob Storage. ADLS Gen2 is the preferred way to store datasets on Azure for data processing and analytics, enabling companies to store large volumes of data at a low cost with very little administration.
Dremio is a Data-as-a-Service platform that enables business analysts and data scientists to easily analyze data in ADLS and other Azure and on-premise data sources using standard SQL and BI tools. With Dremio, companies can create next-generation cloud data lakes on ADLS, making data easy to consume while providing the interactive performance that users demand. Dremio executes the queries directly against ADLS (and other sources) while leveraging patent-pending technology to accelerate query execution so that there’s no need to load the data into other systems, such as data marts, cubes, aggregation tables and BI extracts. The data in ADLS can reside in a variety of file formats, including Parquet, ORC, JSON and text-delimited (e.g., CSV).
Dremio provides business analysts and data scientists with a OneDrive- or Office 365-like experience for the company’s data, making it easy to discover and explore datasets, curate new virtual datasets, and collaborate with other users within the company. The system’s deep integration with Azure Storage and other sources, combined with its distributed SQL query engine, make it possible for users to interact with any data at any time. The engine provides ANSI SQL capabilities, including complex joins, large aggregations, common table expressions, sub-selects, window functions, statistical functions and more. With Power BI’s built-in Dremio connector, as well as Dremio’s ODBC, JDBC, REST and Arrow Flight (beta) interfaces, it is easy to use any client application to query the data.
In this tutorial, we will show how you can build a cloud data lake on Azure using Dremio. We will walk you through the steps of creating an ADLS Gen2 account, deploying a Dremio cluster using our newly available deployment templates, followed by how to ingest sample data to then consume it using Power BI.
Create an ADLS Gen2 account and file system
To create an ADLS Gen2 file system, start by selecting Storage accounts in the Azure portal and then clicking the Add button:
Choose the subscription and resource group to which this storage account should belong, and then provide a unique name and choose the Location. In order to ensure that this storage account will support ADLS Gen2, set the Account kind to StorageV2:
You can also change Replication to achieve the desired redundancy. In our case, Zone-redundant storage (ZRS) is sufficient because we want to tolerate a datacenter outage, but we are not looking to tolerate a regional outage. Note that you can change the replication strategy at a later time, so for production data lakes we recommend leveraging multiple storage accounts so that you can apply different replication strategies for different datasets.
Once you’re happy with the settings, click Next : Advanced > and enable Hierarchical namespace. This is the parameter that distinguishes an ADLS Gen2 account from a vanilla Azure Blob Storage V2 account. Hierarchical namespace enables atomic operations on directories within the Azure Storage account, thereby accelerating a variety of operations. In addition, it enables file ACLs.
Click the Review + create button at the bottom, and make sure the settings look good:
Click Create, and once the deployment is complete, you’ll see a notification in the Azure portal. Click on the Go to resource button:
Alternatively, use the Azure Portal search bar to find the new ADLS Gen2 account (we named it dremiodatalake):
Once you select the dremiodatalake resource (i.e., storage account), you’ll see a dedicated page for managing your Azure Storage account:
Notice the Data Lake Gen2 file systems link. Before you can load data into ADLS Gen2, you need to create a file system. An Azure Storage account with ADLS Gen2 (i.e., hierarchical namespace) enabled can have one or more file systems at the top level of the account. Files and directories reside within a file system. Note that a file system in an ADLS Gen2 account is analogous to a blob in an Azure Storage account that does not utilize hierarchical namespace. Click + File system to add a file system.
At this point you can select the new file system, and you should see something like this:
All these objects can be confusing, so let’s recap what they are:
|dremiodatalake||An ADLS Gen2 account, also known as an Azure Storage account with hierarchical namespace enabled|
|datalake||The resource group containing the dremiodatalake Azure Storage account.|
|datalake-filesystem||A file system within the dremiodatalake Azure Storage account.|
Deploy a Dremio cluster
Dremio is a cloud-native data lake solution that can run on ADLS Gen2. There are two ways to run a Dremio environment on Azure:
- Azure Resource Manager (ARM) is the recommended approach for evaluations and ephemeral clusters.
- Azure Kubernetes Service (AKS) is the recommended approach for production long-lived clusters.
In this section we use ARM to deploy Dremio on Azure. Locate and click the Deploy to Azure button on the Dremio website:
This button leverages an ARM template to launch a Dremio cluster:
Create a new resource group for this compute cluster (that way it’s easier to tear down or replace this compute cluster later). It’s a good practice to use the same name for Resource group and Cluster Name. Note that the cluster name is simply used as a prefix for the names of all the resources that this template creates inside the resource group.
Choose the desired cluster size from the list of t-shirt sized options. When you’re done, click Purchase. Note that you are not actually purchasing anything from Dremio, but we are spinning up some Azure VMs).
Once you click Purchase, Azure will deploy a Dremio cluster in the specified resource group. This usually take 5-7 minutes. When it’s done, locate the resource group and click on it:
Assuming the deployment was successful, you’ll see numerous Azure resources inside the resource group:
Choose the Deployments tab, and select Microsoft.Template (this represents the ARM template we used to deploy Dremio). Then choose the Outputs tab, and copy the URL:
Paste this URL into a browser to access the Dremio UI, and then create your first user:
To see how many nodes are in the cluster, check out the Admin > Node Activity page.
Notice that this page indicates there is 1 coordinator and 5 executors, which is consistent with our selection of the Small (5 executors) option when deploying the cluster. The coordinator is the edge node of the cluster, and is responsible for serving the UI and planning SQL queries, while the executors are responsible for executing the query. You can also list the executors by running this SQL command:
1 select * from sys.nodes
To run this query in the Dremio UI, simply click the New Query button at the top, type the query in the box, and click the Run button on the top-right of the screen (you’ll want to change the mode from Preview to Run before you do this):
If you ever need to access the coordinator node for troubleshooting purposes, simply ssh to the same domain name specified in the URL:
1 2 3 4 5 6 7 **$ ssh email@example.com** The authenticity of host 'dremio-master-spebbxbjntlvy.westus2.cloudapp.azure.com (22.214.171.124)' can't be established. ECDSA key fingerprint is SHA256:MQ7oOZGvvO9h2cpsJ5L5leAtP1s4oVJ7kyroZ/Lv/lI. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'dremio-master-spebbxbjntlvy.westus2.cloudapp.azure.com,126.96.36.199' (ECDSA) to the list of known hosts. Password: m [jdoe@datalake-cluster ~]$
Open the datalake-cluster resource group in the Azure Portal, and click on the datalake-clusters-executors virtual machine scale set:
Here you’ll be able to access the Scaling tab to scale the cluster (i.e., executors) up and down:
Connect the Dremio cluster to the ADLS Gen2 file system
On the main Dremio UI screen, click the Add Source button. Because ADLS Gen2 is a type of Azure Storage account, choose Azure Storage as the data source type:
Choose a business-friendly name for this source, and enter the connection details including the Azure Storage account name (dremiodatalake). Make sure StorageV2 is selected (indicating a Gen2 account), and enter one of your access keys for the ADLS Gen2 storage account, which you can find in the Azure Portal:
The data source dialog in the Dremio UI should now look like this:
Click Save to add this source. You should now be able to browse the file systems, directories and files in your ADLS Gen2 account by clicking on the new source in the Dremio UI:
Ingest sample data into the data lake
There are two tools provided by Microsoft that make it easy to upload data from your laptop (or a server) into an ADLS Gen2 file system: Azure Data Explorer (GUI) and AzCopy (CLI). Both tools work on a variety of platforms, including Windows, Mac and Linux. In this tutorial we’ll use Azure Data Explorer, which you can download from Microsoft. Once you’ve downloaded the application, launch it and enter your Azure credentials:
Notice that Azure Data Explorer presents numerous types of accounts under the Azure subscription. For example, in the preceding screenshot you can see Cosmos DB accounts, ADLS Gen1 accounts and Azure Storage Accounts. Notice that our ADLS Gen2 account is listed under Storage Accounts - this is because ADLS Gen2, unlike ADLS Gen1, is a type of Azure Storage account, rather than a completely different account type.
Click dremiodatalake (ADLS Gen2) to access the ADLS Gen2 account we created earlier for our data lake. Then expand Blob Containers, and click on the file system we created earlier:
This file system (datalake-filesytem) is the root of our data lake storage. We can add directories and files underneath. Click the New Folder button at the top and create a folder called crime. Then download the San Francisco Police Department’s public dataset “Police Department Incident Reports: 2018 to Present” and upload it into that ADLS folder by clicking the Upload button in Azure Data Explorer:
You should see something like this:
Let’s head back into the Dremio UI and make sure we can see this file. Choose the data source named lake, and then click datalake-filesystem and click crime.
To parse this file you’ll need to change Line Delimiter to LF - Unix/Linux (\n) and enable the checkbox Extract Field Names:
Click Save and notice that there is now a purple dataset icon next to the file name:
The purple icon indicates that this is now considered a physical dataset rather than just a file. The dataset is recognized by Dremio in its data catalog and exposed to any BI tool that connects to Dreimo. Keep in mind that in many real-world scenarios a dataset is not just a single file, but a folder with many files. To promote a folder to physical dataset, click on the dataset promotion button while hovering over the folder:
Run your first SQL query on the sample data
Let’s proceed to run a SQL query on this dataset. Click New Query in the Dremio UI and enter the following SELECT statement:
1 SELECT * FROM lake."datalake-filesystem".crime."Police_Department_Incident_Reports__2018_to_Present.csv"
Click the Preview button to run this query on a sample of the data, or the Run button to run it on the entire dataset:
All datasets in Dremio, regardless of the data source, are represented by a unique dot-separated path. In this example, the path consists of the following parts:
- lake: The data source name (which you entered when creating the Azure Storage source in Dremio)
- datalake-filesystem: The ADLS Gen2 file system name (which you created in the Azure Portal after creating a storage account).
- crime: A folder inside the file system.
- Police_Department_Incident_Reports__2018_to_Present.csv: The name of the physical dataset. In this case it’s a file name. If the dataset is a folder with many files (potentially in a hierarchy of folders), then the dataset name is the folder name.
Note that quotes are required when identifiers contain special characters such as hyphens, or when they are “reserved” words. You can also click the clipboard icon next to a dataset in the Dremio UI to copy the path to the clipboard:
Clicking on a dataset in the Dremio UI is another way to create a “SELECT * FROM” query. Once you enter the dataset explorer screen, simply click the little triangle at the top-left to make the SQL editor visible:
Create new virtual datasets
When clicking on a dataset, you enter the dataset explorer screen which allows you to explore, curate and annotate the dataset:
For example, go ahead and cast the Incident Time column to a time datatype by clicking on the little Abc icon in the column header:
Choose the format HH24:MI (keeping the original column name) and click Apply:
Notice that the purple dataset icon, representing a physical dataset, changed to a green dataset icon, representing a virtual dataset. In addition, the path to the dataset was replaced by the word New Query:
If you’re familiar with SQL, you can expand the SQL Editor see the definition of this new virtual dataset:
1 SELECT "Incident Datetime", "Incident Date", **TO_TIME("Incident Time", 'HH24:MI', 1)** AS "Incident Time", ... FROM lake."datalake-filesystem".crime."Police_Department_Incident_Reports__2018_to_Present.csv"
Notice that the data prep interactions in the Dremio UI simply transform the SQL definition of the virtual dataset, so we’re not actually creating a copy of the data.
Now that we’ve converted the Incident Time field to an actual time type, let’s convert the Incident Date field to a date type. Choose a custom format (YYYY/MM/DD) to match the existing format of this column in the CSV file:
Let’s further modify this dataset to only include weekend crimes. If you’re familiar with SQL, go ahead and add a WHERE clause to the SELECT statement. Let’s see how we can do this graphically as well. We can use the column selector to narrow in on a subset of the columns. Type day and notice how the dataset grid adjusts to show only the columns that include the term day in them:
Click the drop-down arrow on the column header and select Keep Only from the menu of actions:
You can now select Saturday and Sunday from the presented histogram:
Click Apply to apply the filter in this virtual dataset. You’ll notice that the SELECT statement now has this additional clause:
1 WHERE "Incident Day of Week" IN ('Saturday', 'Sunday')
We can save this new virtual dataset for future use, but first let’s create a shared space in which to save it. To create a shared space, open the main screen in a new tab by right-clicking the Dremio logo at the top, and click the Add Space button to create a new space called demo.
Back in the previous tab, click Save As at the top of the screen save the newly created virtual dataset in the demo space. Call it weekend crimes and click Save:
The virtual dataset now has a new name, demo.”weekend crimes”:
And you can, of course, run any SQL query on this virtual dataset like any other dataset in Dremio:
1 SELECT "Incident Category", COUNT( * ) AS Count_Star FROM demo."weekend crimes" GROUP BY "Incident Category" ORDER BY Count_Star DESC
Create a Power BI visualization on the sample data
Dremio supports a wide variety of BI and data science tools, such as Power BI, Tableau and Jupyter Notebooks. In this section we’ll use Power BI Desktop, a free download from Microsoft, to connect to our Dremio-powered data lake and visualize the San Francisco crime data from the previous sections.
Make sure you’ve installed the Dremio ODBC driver which you can download for free from the Dremio website (https://www.dremio.com/drivers/). Once installed, launch Power BI Desktop and click the Get Data button. Choose the Dremio data source and click Connect:
In the next dialog, enter the address of the Dremio coordinator (this is the same address you’ve been using in your browser to access the Dremio UI). In addition, change Data Connectivity mode to DirectQuery:
Enter your Dremio username and password and click Connect:
In the Navigator dialog, notice that any folder that contains datasets - whether physical datasets within a source system, or virtual datasets within a space - is shown in the tree on the left. Expand the demo space to reveal the weekend crimes virtual dataset. Click the checkbox and click Load (note that Power BI is not actually loading the data, because we are in DirectQuery mode):
Once you’ve clicked Load, Power BI will retrieve the metadata from Dremio, and present the fields in the weekend crimes dataset in the Fields panel on the right. To create a bar chart showing the most prevalent incident categories, drag the bar chart icon (in the Visualizations panel) onto the canvas. Then drag the Incident Category field into both the Axis and Value shelves (the latter because we’re just counting):
Within a few seconds, you will see the chart populated. Dremio is enabling Power BI to run queries directly on files in ADLS Gen2.
Join data across disparate object stores, databases and locations
The cloud has made it increasingly common to have related data stored in different systems and locations. For example, when migrating from an on-premise environment to Azure, or from another cloud provider to Azure, you will likely find yourself having to combine data from multiple sources (e.g., Big Data on ADLS and reference data in an on-premise SQL Server). Dremio makes it easy to join data from different sources. In addition, its abstraction layer, powered by virtual datasets, makes it possible to move data over time without having to change any queries, reports and applications.
For the purpose of this tutorial we’ll use a very simple example, leveraging a file-based dataset in Amazon S3. Click the Add Source button in the Dremio UI, choose Amazon S3, and connect to the public bucket samples.dremio.com without authentication:
Once you’ve added the source, located and click the dataset remotelake.”samples.dremio.com”.”SF weather 2018-2019.csv”. Format the dataset by choosing the LF - Unix/Linux (\n) as the line delimiter, and selecting the Extract Field Names option:
Convert the DATE column to an actual date type by clicking the datatype icon in the column header:
Keep the default format selection (YYYY-MM-DD), and click Apply. Save this as a new virtual dataset in our demo space under the name weather:
We should now have two datasets in the demo space: weekend crimes and weather:
Let’s join the two datasets and save this join as a new virtual dataset that you can then query from Power BI or any other client. We can do this graphically, or we can type a SQL statement. Let’s try the latter this time. Click New Query at the top. To make your life easier, set the context (similar to a USE statement in SQL) by clicking the little pencil icon at the bottom right-corner of the SQL editor, and then picking the demo space:
Enter the following SQL statement:
1 SELECT * FROM weather w, "weekend crimes" c WHERE w."DATE" = c."Incident Date"
Click Preview and then Save As to save this as a virtual dataset called weekend crimes with weather:
Any time this virtual dataset is queried, Dremio performs a distributed join between the crime dataset in ADLS Gen2, and the weather dataset in S3. It’s as simple as that!
In this tutorial we discussed what is ADLS Gen2 and how Dremio, the Data-as-a-Service platform enables business analysts and data scientists to easily analyze data in ADLS and other Azure and on-premise data sources using standard SQL and BI tools.
Additionally, we walked through the steps of using templates to easily deploy a Dremio cluster on Azure, we showed you how to perform all the necessary steps to take advantage of this deployment not only to analyze data stored in ADLS but also we shared the steps to join data across multiple object stores, data bases and locations to finally consume using Power BI.
We hope you enjoyed this tutorial, stay tuned for more tutorials and resources to learn how you can use Dremio to start gaining insights from your data, faster.