13 minute read · April 22, 2019
How to Analyze ADLS Data Using R and The Data Lake Engine
· Dremio Team
Intro
Azure Data Lake is a scalable data storage and analytics service. It reduces efforts for everyone to store data of any format and type and do processing and analytics across platforms and languages. Azure Data Lake provides a variety of functions and solutions for data management and governance. For making your application more powerful, you can integrate ADLS with operational stores and data warehouses.
In this tutorial, we will analyze data stored in Azure Data Lake Store with R. To combine these two tools, we will use Dremio which will also be helpful for performing data preprocessing. Then, We will connect from R to Dremio and build a simple regression model.
Assumptions
We assume that you have Dremio and ODBC driver installed; if not, go to Dremio’s deploy page, pick the installations for your operating system, and visit Dremio Docs to read about installations and deployments. You should have an Azure Account and storage account; also, we will be working with R and RStudio, so you need to have them installed and configured as well.
Importing data into Azure Data Lake Store
This time we will analyze the house pricing dataset available in Kaggle. This dataset contains house sale prices for King County, namely homes sold between May 2014 and May 2015.
To import data into Azure Azure Data Lake Storage, go to Azure All services page, open Storage and choose Data Lake. Then, add a container and upload data as shown on the screenshots below.
Connecting to Dremio
When data is uploaded, log in to Dremio and choose an option to add a new source.
Before connecting we need to do some preparations:
- 1. Go to Azure Portal and select Azure Active Directory from the left navigation bar. Then select App Registrations.
- 2. Click on Endpoints and copy OAUTH 2.0 TOKEN ENDPOINT to Dremio source form.
- 3. Click New application registration. Select a Name, choose Web app / API as type and enter a Sign-on URL. This URL can be a valid arbitrary URL.
- 4. Click on the registered app and copy Application ID to Dremio source form.
- 5. While in the registered app, select Keys under API Access. Enter a Description and select an expiration. Click Save and copy Value to Dremio source form.
- 6. Also, in Data Explorer -> Access provide your app access to read, write and execute data.
Data curation in Dremio
Now, we can begin data preprocessing. First, let’s drop the id and date columns. Usually, they are useful, especially date while providing analysis based on time changing. However, in our case, we can drop them.
We can see that all columns are seen as char. We need to transform them to numeric data types. Click on “Abc” or choose “Convert Data Type” from the list.
Also, Dremio provides handy and powerful filtering. There are three different methods. You can write the custom condition by yourself or use functions from a list. Using the Preview option helps to understand what you get as a result. Let’s exclude houses that have fewer than 2 bathrooms:
After all the preparations, Dremio will automatically generate the final SQL script:
SELECT CONVERT_TO_FLOAT(price, 1, 1, 0) AS price, CONVERT_TO_INTEGER(bedrooms, 1, 1, 0) AS bedrooms, CONVERT_TO_FLOAT(bathrooms, 1, 1, 0) AS bathrooms, CONVERT_TO_INTEGER(sqft_living, 1, 1, 0) AS sqft_living, CONVERT_TO_INTEGER(sqft_lot, 1, 1, 0) AS sqft_lot, CONVERT_TO_FLOAT(floors, 1, 1, 0) AS floors, CONVERT_TO_INTEGER(waterfront, 1, 1, 0) AS waterfront, CONVERT_TO_INTEGER(view, 1, 1, 0) AS view, CONVERT_TO_INTEGER(house."condition", 1, 1, 0) AS "condition", CONVERT_TO_INTEGER(grade, 1, 1, 0) AS grade, CONVERT_TO_INTEGER(sqft_above, 1, 1, 0) AS sqft_above, CONVERT_TO_INTEGER(sqft_basement, 1, 1, 0) AS sqft_basement, CONVERT_TO_INTEGER(yr_built, 1, 1, 0) AS yr_built, CONVERT_TO_INTEGER(yr_renovated, 1, 1, 0) AS yr_renovated, CONVERT_TO_INTEGER(zipcode, 1, 1, 0) AS zipcode, CONVERT_TO_FLOAT(lat, 1, 1, 0) AS lat, CONVERT_TO_FLOAT(long, 1, 1, 0) AS long, CONVERT_TO_INTEGER(sqft_living15, 1, 1, 0) AS sqft_living15, CONVERT_TO_INTEGER(sqft_lot15, 1, 1, 0) AS sqft_lot15 FROM ( SELECT price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, data_adls."condition" AS "condition", grade, sqft_above, sqft_basement, yr_built, yr_renovated, zipcode, lat, long, sqft_living15, sqft_lot15 FROM ADLS.house ) nested_0 WHERE ( "bathrooms" < 2 ) IS FALSE
We have created a new space for our data and will save it there.
Creating a model in R
Before we start, we need to connect R to Dremio. We can do it using the following code:
install.packages("RODBC") library(RODBC) dremio_host <- "localhost" dremio_port <- "31010" dremio_uid <- "dremio_user" dremio_pwd <- "dremio_pass" channel <- odbcDriverConnect(sprintf("DRIVER=Dremio Connector;HOST=%s;PORT=%s;UID=%s;PWD=%s;AUTHENTICATIONTYPE=Basic Authentication;CONNECTIONTYPE=Direct", dremio_host, dremio_port, dremio_uid, dremio_pwd)) df <- sqlQuery(channel, "SELECT * FROM azure.data_adls") if (is.character(df)) { close(channel); stop(paste(df, collapse = "\n")) }
When data is uploaded, we can begin building a linear model (regression). Regression is used for predictive analysis. It gives us an answer to the question “What is going to be next?”. Also, it helps to understand which values affect the final result. Let’s see the internal structure of our data.
library(corrplot) library(caret) str(house)
Also, we display the first rows of data. The result is shown on the screenshot below.
head(house)
Now, we will find a correlation among all the features and price and build a correlation plot.
correlation <- cor(house) corrplot(correlation, type="full", method = "circle", main="Correlation")
In order to have a view of multiple associations, we should use correlation heatmap. As it is an urgent task to reveal relations between data in order to receive better accuracy, we can’t skip this step. Correlation heatmap can be visualized in different ways. Here is an example of a lower heatmap.
We will use this chart a little bit later. Now, select 80% of data as a sample from total ‘n’ rows of the data.
set.seed(100) sample <- sample.int(n=nrow(house), size = floor(0.80*nrow(house)), replace = F)
We split train and test data and create a linear model. Let’s choose significant parameters from our correlation heatmap.
train <- house[sample, ] test <- house[-sample, ] mod <- lm(price ~ sqft_living + grade + waterfront + view + condition +zipcode+bathrooms+bedrooms, data = train) summary(mod)
R-Squared is 0.834, adjusted R-Squared is 0.8322 which is a good result.
Looking at the summary, it seems that bedrooms contribution is not significant with other considered features in the model. Let’s see if we can find better results. We will take the bedroom out and create another model mod1.
mod1 <- lm(price ~ sqft_living +grade + waterfront + view + condition +zipcode+bathrooms, data = train) summary(mod1)
As we can see, there wasn’t a significant change after taking out bathrooms. You can also try out different models based on correlation heatmap.
Now, we will predict test data on mod1.
predP <- predict(mod1, test) act_pred <- data.frame(obs=test$price, pred=predP) defaultSummary(act_pred)
Finally, don’t forget to close the channel.
close(channel)
Conclusions
In this tutorial, we showed how to build a regression model in R using data stored in Azure Data Lake Store. We used Dremio to connect to the data in ADSL and perform data preprocessing, so then we could easily connect to Dremio from R and work with clean data.
We hope you enjoyed this tutorial. stay tuned for more tutorials and resources to learn how you can use Dremio to start gaining insights from your data, faster.