
Using Data Mesh to Advance Distributed Data Access, Agility and Governance
Join this live fireside chat to learn about using Data Mesh to Advance Distributed Data Access, Agility and Governance.
read moreAzure Data Lake is a scalable data storage and analytics service. It reduces efforts for everyone to store data of any format and type and do processing and analytics across platforms and languages. Azure Data Lake provides a variety of functions and solutions for data management and governance. For making your application more powerful, you can integrate ADLS with operational stores and data warehouses.
In this tutorial, we will analyze data stored in Azure Data Lake Store with R. To combine these two tools, we will use Dremio which will also be helpful for performing data preprocessing. Then, We will connect from R to Dremio and build a simple regression model.
We assume that you have Dremio and ODBC driver installed; if not, go to Dremio’s deploy page, pick the installations for your operating system, and visit Dremio Docs to read about installations and deployments. You should have an Azure Account and storage account; also, we will be working with R and RStudio, so you need to have them installed and configured as well.
This time we will analyze the house pricing dataset available in Kaggle. This dataset contains house sale prices for King County, namely homes sold between May 2014 and May 2015.
To import data into Azure Azure Data Lake Storage, go to Azure All services page, open Storage and choose Data Lake. Then, add a container and upload data as shown on the screenshots below.
When data is uploaded, log in to Dremio and choose an option to add a new source.
Before connecting we need to do some preparations:
Now, we can begin data preprocessing. First, let’s drop the id and date columns. Usually, they are useful, especially date while providing analysis based on time changing. However, in our case, we can drop them.
We can see that all columns are seen as char. We need to transform them to numeric data types. Click on “Abc” or choose “Convert Data Type” from the list.
Also, Dremio provides handy and powerful filtering. There are three different methods. You can write the custom condition by yourself or use functions from a list. Using the Preview option helps to understand what you get as a result. Let’s exclude houses that have fewer than 2 bathrooms:
After all the preparations, Dremio will automatically generate the final SQL script:
SELECT CONVERT_TO_FLOAT(price, 1, 1, 0) AS price, CONVERT_TO_INTEGER(bedrooms, 1, 1, 0) AS bedrooms, CONVERT_TO_FLOAT(bathrooms, 1, 1, 0) AS bathrooms, CONVERT_TO_INTEGER(sqft_living, 1, 1, 0) AS sqft_living, CONVERT_TO_INTEGER(sqft_lot, 1, 1, 0) AS sqft_lot, CONVERT_TO_FLOAT(floors, 1, 1, 0) AS floors, CONVERT_TO_INTEGER(waterfront, 1, 1, 0) AS waterfront, CONVERT_TO_INTEGER(view, 1, 1, 0) AS view, CONVERT_TO_INTEGER(house."condition", 1, 1, 0) AS "condition", CONVERT_TO_INTEGER(grade, 1, 1, 0) AS grade, CONVERT_TO_INTEGER(sqft_above, 1, 1, 0) AS sqft_above, CONVERT_TO_INTEGER(sqft_basement, 1, 1, 0) AS sqft_basement, CONVERT_TO_INTEGER(yr_built, 1, 1, 0) AS yr_built, CONVERT_TO_INTEGER(yr_renovated, 1, 1, 0) AS yr_renovated, CONVERT_TO_INTEGER(zipcode, 1, 1, 0) AS zipcode, CONVERT_TO_FLOAT(lat, 1, 1, 0) AS lat, CONVERT_TO_FLOAT(long, 1, 1, 0) AS long, CONVERT_TO_INTEGER(sqft_living15, 1, 1, 0) AS sqft_living15, CONVERT_TO_INTEGER(sqft_lot15, 1, 1, 0) AS sqft_lot15 FROM ( SELECT price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, data_adls."condition" AS "condition", grade, sqft_above, sqft_basement, yr_built, yr_renovated, zipcode, lat, long, sqft_living15, sqft_lot15 FROM ADLS.house ) nested_0 WHERE ( "bathrooms" < 2 ) IS FALSE
We have created a new space for our data and will save it there.
Before we start, we need to connect R to Dremio. We can do it using the following code:
install.packages("RODBC") library(RODBC) dremio_host <- "localhost" dremio_port <- "31010" dremio_uid <- "dremio_user" dremio_pwd <- "dremio_pass" channel <- odbcDriverConnect(sprintf("DRIVER=Dremio Connector;HOST=%s;PORT=%s;UID=%s;PWD=%s;AUTHENTICATIONTYPE=Basic Authentication;CONNECTIONTYPE=Direct", dremio_host, dremio_port, dremio_uid, dremio_pwd)) df <- sqlQuery(channel, "SELECT * FROM azure.data_adls") if (is.character(df)) { close(channel); stop(paste(df, collapse = "\n")) }
When data is uploaded, we can begin building a linear model (regression). Regression is used for predictive analysis. It gives us an answer to the question “What is going to be next?”. Also, it helps to understand which values affect the final result. Let’s see the internal structure of our data.
library(corrplot) library(caret) str(house)
Also, we display the first rows of data. The result is shown on the screenshot below.
head(house)
Now, we will find a correlation among all the features and price and build a correlation plot.
correlation <- cor(house) corrplot(correlation, type="full", method = "circle", main="Correlation")
In order to have a view of multiple associations, we should use correlation heatmap. As it is an urgent task to reveal relations between data in order to receive better accuracy, we can’t skip this step. Correlation heatmap can be visualized in different ways. Here is an example of a lower heatmap.
We will use this chart a little bit later. Now, select 80% of data as a sample from total ‘n’ rows of the data.
set.seed(100) sample <- sample.int(n=nrow(house), size = floor(0.80*nrow(house)), replace = F)
We split train and test data and create a linear model. Let’s choose significant parameters from our correlation heatmap.
train <- house[sample, ] test <- house[-sample, ] mod <- lm(price ~ sqft_living + grade + waterfront + view + condition +zipcode+bathrooms+bedrooms, data = train) summary(mod)
R-Squared is 0.834, adjusted R-Squared is 0.8322 which is a good result.
Looking at the summary, it seems that bedrooms contribution is not significant with other considered features in the model. Let’s see if we can find better results. We will take the bedroom out and create another model mod1.
mod1 <- lm(price ~ sqft_living +grade + waterfront + view + condition +zipcode+bathrooms, data = train) summary(mod1)
As we can see, there wasn’t a significant change after taking out bathrooms. You can also try out different models based on correlation heatmap.
Now, we will predict test data on mod1.
predP <- predict(mod1, test) act_pred <- data.frame(obs=test$price, pred=predP) defaultSummary(act_pred)
Finally, don’t forget to close the channel.
close(channel)
In this tutorial, we showed how to build a regression model in R using data stored in Azure Data Lake Store. We used Dremio to connect to the data in ADSL and perform data preprocessing, so then we could easily connect to Dremio from R and work with clean data.
We hope you enjoyed this tutorial. stay tuned for more tutorials and resources to learn how you can use Dremio to start gaining insights from your data, faster.
Join this live fireside chat to learn about using Data Mesh to Advance Distributed Data Access, Agility and Governance.
read moreThe data lakehouse is a new architecture that combines the best parts of data lakes and data warehouses. Learn more about the data lakehouse and its key advantages.
read more