Data Science on The Data Lake: A Quick Guide to Feature Engineering Using Dremio
Dremio is a data lake engine that allows you to work directly on the sources of data. This means that you don’t need to build complex extract-transform-load (ETL) pipelines, intermediate aggregation tables, or layers. You can take Dremio and issue fast SQL queries to your data even if data sources don’t support SQL. Dremio has built-in connectors for a lot of different sources; not only for data lakes, but also for tools such as ElasticSearch, MongoDB, Hive, PostgreSQL, and many others. In addition, you can create your custom connectors if you need to.
Data analysts and data scientists are among the main users of Dremio. They often need to perform feature engineering. The essence of feature engineering is the creation of new features based on the existing features in the dataset.
For data scientists, the main goal of this process is to increase the predictive power of machine learning models. By creating new features, we can fine tune models and enhance their accuracy. Sometimes even machine learning algorithms are unable to see some patterns in data. Thanks to feature engineering, they can use those patterns to generate predictions. For example, think about dates in the string data type format. Using feature engineering, we can take months and years as separate explicit features. As a result, models will be able to use them.
Data analysts often need to add new features in the dataset to perform analysis and prepare reports that include hidden information and insights.
In a relational table view, features are equal to columns. Dremio provides a convenient way to create new columns in the data curation step. In this tutorial, we will show how to do this using a subset of the Beijing Multi-Site Air-Quality Data Data Set.
Also, we assume that the dataset is already in Dremio. This means that you have uploaded the dataset into one of the data sources that Dremio supports out-of-the-box and then connected this source to Dremio. For example, you can use Amazon Web Services S3 storage. We have many tutorials where we describe the full pipeline of uploading data to the data source and connecting it with Dremio.
The selected dataset is big. We don’t need to use its full version for the purposes of this tutorial; we’ll only need to work with part of it. This part includes data from the Dongsi air-quality monitoring site.
Example(s) of creating a new column using the ADD CALCULATED FIELD button
Initially, we have the dataset with columns for the date (year, month, day) and for the different air quality metrics (like CO, SO2, NO2, etc.). Also, there are columns for weather data (TEMP, RAIN, PRES, etc.). The beginning of the data frame (its left side) looks like this:
Here is the inner part of the table (where the air quality indicators are located):
Suppose that we are going to build a model for sulfur dioxide (SO2) prediction, given information about the weather, time, and concentrations of other substances in the air. It may be useful for the model to use not only absolute values of gases concentrations, but also to have features that represent relations between levels of different gases. We don’t know beforehand whether this will help. But model-test-improve cycles are natural for machine learning.
So let’s say that we want to have a column (feature) where the ratio between ozone (O3) and carbon monoxide (CO) concentrations is represented. In other words, we want to place the results of dividing the concentration of ozone by the concentration of carbon monoxide into the new column. To do this in Dremio:
Click the Add Field button:
In the opened section, enter the formula for the new field calculation:
Note that we put names of the columns in double-quotes. Enter the name for the new column:
Then, click on the Preview button. Check if the calculations were fulfilled as expected:
If it is what you want to do, click the Apply button. If not, think about how to change the formula.
This was a simple example. It engaged only a mathematical operator (the division). But Dremio also supports more sophisticated calculations. A range of defined functions serves this purpose. Let’s look at another example.
We have a column with a temperature for each observation:
Suppose that we are interested not in the exact value of temperature, but in whether it is above zero (positive number) or below zero (negative number) degrees. In other words, we want to extract the sign of the number from the TEMP column. This may be useful because we will provide the model with the precise pattern that can help it to predict the target variable. In other words, by having this feature, the model will not need to analyze temperature feature and search for some pattern. The model will have the column containing information about whether the temperature is below or above zero. Extracting some hidden information from existing features and representing this information in a clear format as a separate feature is one of the main techniques of feature engineering.
Here is how to do it in Dremio:
The first step is the same - click the Add Field button. From the right side of the opened section, the list with available functions is displayed:
You can find the needed function either by scrolling the list or by typing a text in the search field. We have entered “si” and here are the search results:
The SIGN() function is what we need. Press the “+” (plus) button to add it to the formula. Pass the name of the column TEMP as a parameter to the SIGN() function. Specify the name for the new column (let’s call it temp_sign in this example). Then press the Preview button:
Check if everything is correct and confirm with the Apply button:
Using formulas, you can perform complex calculations for features creation based on several columns and with different conditions.
Example of creating a new column using SQL query in Dremio SQL Editor
The described approach to creating new columns in Dremio is not unique. Dremio supports SQL query to the datasets, so we can use SQL to create new fields. In this section, we will show how to do this.
You can open the SQL editor with the corresponding SQL Editor button:
Here is what you can see there:
This field can be used to write custom SQL queries. Among other things, these queries can create new columns.
Let’s create a column which will show which period of the year the current observation belongs to. This can improve the prediction results because it is reasonable to suppose that the concentrations of different gases in the air can vary depending on the season. The model can fail to understand this if it only has the ordinary number of the month. So we will again extract the hidden information and represent it in a clear format. By the way, feature engineering often relies on the so-called subject domain expertise. We can have an ecologist or chemist tell us what factors may affect the level of sulfur dioxide (our target variable). Then we will work on the dataset to represent those factors as new features.
To generate this feature, we will use the SQL CASE statement.
Add the following code into the field in SQL Editor:
1 2 3 4 5 6 7 8 <table> SELECT * , CASE WHEN "month" >= 3 AND "month" <= 5 THEN 'Spring' WHEN "month" >= 6 AND "month" <= 8 THEN 'Summer' WHEN "month" >= 9 AND "month" <= 11 THEN 'Autumn' WHEN "month" = 3 OR "month" <= 2 THEN 'Winter' END AS "year_period" FROM beijing_air_1
Now click the Preview button at the top right corner of the page:
In the result, we should get the new column called “year_period” in the data frame:
Save the data frame using the Save As button.
In a similar way, you can create other columns if you need to. Also, you can use Dremio’s SQL Editor to perform almost all of the operations which you can perform through a DBMS like PostgreSQL, MySQL, etc. But with Dremio, you can write SQL queries to sources which don’t naturally support SQL.
In this tutorial, we have demonstrated how to use Dremio to create new columns based on existing data. We did it in two ways: with the Add Field button and via SQL querying. The first method is relevant when the formula required to calculate the new field is relatively simple. The second method allows you to create columns from complex and long SQL queries. Use the Dremio’s ability to add new columns when you need to create additional inputs to the machine learning model or in your analytical work.
To learn more about Dremio visit our tutorials and resources, also if you would like to experiment with Dremio on your own virtual lab, go ahead and checkout Dremio University, and if you have any questions visit our community forums where we are all eager to help.