Dremio Jekyll

Using Dremio and Python Dash to Visualize Data from Amazon S3

Dremio

Intro

Data in its natural form is not that valuable if you cannot visualize it. There are lots of visualization libraries available in the community, which may make it difficult to select one. In this tutorial, we hope to make your selection process a little bit easier by showing you how to work with Dash.

Dash is a productive Python framework for building web applications. Written on top of Flask, Plotly.js, and React.js, Dash gives you the ability to build highly customizable visualization apps to consume your Python data.

In this article, we will use Dremio to connect to two different datasets located in different data sources, then we will connect Python to Dremio and analyze the data and then use Dash to create the visualization app.

I will walk you through the steps of making sure we have everything we need, including installing Dash on your environment.

Assumptions

We recommend that you have the following installed in your environment before continuing:

Installing Dash

This is a very simple process, first we need to install the core dash backend by running the following command:

pip install dash==1.1.1

Then we need to install the DAQ components:

pip install dash-daq==0.1.0

Testing Dash

Now let’s make sure our Dash instance is working. Using your text editor of choice, create a file named “dremio_dash.py” and add the following code to it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import dash
import dash_core_components as dcc
import dash_html_components as html
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
app.layout = html.Div(children=[
    html.H1(children='Visualizing Data with Dash - Explained by Dremio'),
    html.Div(children='''
        Dash: A web application framework for Python.
    '''),
    dcc.Graph(
        id='example-graph',
        figure={
            'data': [
                {'x': [1, 2, 3], 'y': [4, 1, 2], 'type': 'bar', 'name': 'SF'},
                {'x': [1, 2, 3], 'y': [2, 4, 5], 'type': 'bar', 'name': u'Montréal'},
            ],
            'layout': {
                'title': 'Dash Data Visualization'
            }
        }
    )
])
if __name__ == '__main__':

    app.run_server(debug=True)

Then in your terminal, execute the app by running the following command:

1
python dremio_dash.py

Having trouble? It is possible that you are using the wrong python version, try the same command on python3 instead.

Once you run the app, you should see the following message on your terminal:

image alt text

Now, in your browser, navigate to http://127.0.0.1:8050 this will open the app that we just created. If everything went as expected, you should see the following rendering on your screen:

image alt text

All good? Now we are going to get some data ready to visualize.

Preparing the data

Now we are going to create and curate a virtual dataset in Dremio for us to visualize in Dash. First, launch Dremio and log in.

image alt text

Then click on the plus sign on the sources section to add a new source

image alt text

Then we will add a new Sample Source, then select the SF_Incidents2016.json set. If you want to join data from multiple data sources, this is the right moment to do so, for more information check out our tutorial: Joining Multiple Data Sources

image alt text

Tidying up the data

There are a few things we need to do to this dataset before it is fully ready for consumption. You will notice that the date, and time fields are ‘string’ fields indicated by the ABC icon next to them.

image alt text

We need to convert these fields to their correct format, starting with the date field, select the “ABC” icon next to it and then from the menu select ‘Date’

image alt text

Then select the format that you want to use, in this case we will use the default YYYY-MM-DD format, then select the option to drop the original field and then click “apply”

image alt text

Now, repeat the same steps for the ‘Time’ field.

image alt text

Since the field contains the hours in 24HR format (indicating 13 instead of 1 PM), we will select the second option: HH24:MI and then apply the changes

image alt text

Save your work!

Since we have made several changes to the dataset now we can save it as a different VDS which will we use later in the visualization. Click on “Save as” and give it a name: sf_incidents_curated and click “save”

image alt text

And that’s it! Now, we are ready to connect to Python and start building our visualization app.

Setting up Python

Now we need to make sure that we have all the components ready to connect Python to Dremio.

If you haven’t done so, please install the latest available version of Pandas

1
>pip install pandas

Similarly, PyODBC, a Python package to allow us to use ODBC drivers will be required. To see if PyODBC is already installed, try importing it inside Python

1
2
3
4
Import pyodbc
Pyodbc.version

>>’4.0.24’

If importing the package fails, it can be installed using pip

1
> pip install pyodbc

Setting up the Dremio Connector

Now that everything has been set up, we can test is everything is up and running by executing the following piece of code.

Note: I will use a Jupyter notebook and Python3 to run the following test, however, you can just create a .py file with the following code and run the app once the code is complete.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import pyodbc
import pandas as pd
#Setup the ODBC connection to the Dremio instance**
host = 'localhost'
port = 31010
uid = 'lucio'
pwd = 'dremio123'
driver = "/Library/Dremio/ODBC/lib/libdrillodbc_sbu.dylib"
cnxn=pyodbc.connect("Driver={};ConnectionType=Direct;HOST={};PORT={};AuthenticationType=Plain;UID={};PWD={}".format(driver,host,port,uid,pwd),autocommit=True)
# SQL style query to retrive the required data from Dremio**
cursor = cnxn.cursor()
sql = '''SELECT IncidntNum, Category, Descript FROM "@lucio".sf_incidents_curated limit 20'''
# Executing the query using the alredy open ODBC connection
test_data = pd.read_sql(sql, cnxn)
# Disaplying the top 5 rows from the retrieved dataset**
test_data.head()

Notice that I’m using the credential and connection information for my environment, you would have to edit the host, user name, password and also the path for the dataset to match your environment’s info.

Once the script runs successfully, you should be able to see the following results:

image alt text

Visualizing using Dash

Now that we have everything in place, it is time for us to build our visualization app using Dash.

First, we are going to import all the necessary libraries for the app.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import dash

import dash_core_components as dcc

import dash_html_components as html

import  pyodbc

import pandas as pd

import plotly.graph_objs as go

from app import app

from dash.dependencies import Input, Output

Then I’m going to set up the queries that I’m going to use to feed the graphs.

1
2
3
4
5
6
7
# SQL style query to retrive the required data from Dremio

cursor = cnxn.cursor()

sql = '''SELECT Fecha, Category, count(*) as incidents FROM "@lucio".IncidentsCurated GROUP BY Category, Fecha order by Fecha'''

sql2= '''SELECT Fecha, Category, PdDistrict, count(*) as incidents FROM "@lucio".IncidentsCurated GROUP BY Category, Fecha, PdDistrict'''

Now, I will set up the data frames that I will use for each graph.

1
2
3
4
5
6
7
8
9
10
11
df = pd.read_sql(sql,cnxn)

df2 = pd.read_sql(sql2,cnxn)

df['Fecha'] = pd.to_datetime(df.Fecha, infer_datetime_format=True)

incidents=len(df.Category)

cat=df['Category'].unique()

res=df2['PdDistrict'].unique()

Then I will build the app layout and bring some external stylesheets

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
app_name = 'dash-timeseriesplot'

external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']

app = dash.Dash(__name__, external_stylesheets=external_stylesheets)

app.title = 'Dremio + Dash'

trace1 = go.Bar(x=df2.PdDistrict, y=df2.incidents, name='PdDistrict')

app.layout = html.Div([html.H1("San Francisco Incidents", style={'textAlign': 'center'}),

    dcc.Dropdown(id='my-dropdown', options=[

        {'label': i, 'value': i} for i in df.Category.unique()

    ], multi=True, value=['ASSAULT'], placeholder='Filter by category...'),

    dcc.Graph(id='Lucio'),

    dcc.Graph(

        id='example-graph',

        figure={

            'data': [trace1],

            'layout':

            go.Layout(title='Number of Incidents per District', barmode='stack')

        })

], className="container")

In the previous block, I’m creating an HTML layout that will contain three different objects:

  1. A text input to include all the incident categories
  2. A line chart
  3. A bar chart

Each one of these elements will be updated by the items selected in the text input using the following function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
@app.callback(

    Output(component_id='Lucio', component_property='figure'),

    [Input(component_id='my-dropdown', component_property='value')])

def update_graph(selected_dropdown_value):

    dropdown = {"ASSAULT": "ASSAULT","NON-CRIMINAL": "NON-CRIMINAL", "OTHER OFFENSES":"OTHER OFFENSES",

    "LARCENY/THEFT":"LARCENY/THEFT","WARRANTS":"WARRANTS","VEHICLE THEFT":"VEHICLE:THEFT", "SUSPICIOUS OCC":"SUSPICIOUS OCC",

     "BURGLARY":"BURGLARY", "VANDALISM":"VANDALISM","MISSING PERSON":"MISSING PERSON","DRUG/NARCOTIC":"DRUG/NARCOTIC"}

    trace1 = []

    for category in selected_dropdown_value:

        trace1.append(go.Scatter(x=df[df["Category"] == category]["Fecha"],y=df[df["Category"] == category]["incidents"],mode='lines',

            opacity=0.7,name=f' {dropdown[category]}',textposition='bottom center'))

    traces = [trace1]

    data = [val for sublist in traces for val in sublist]

    figure = {'data': data,

        'layout': go.Layout(colorway=["#5E0DAC", '#FF4F00', '#375CB1', '#FF7400', '#FFF400', '#FF0056'],

            height=600,title=f"Incidents per Category {', '.join(str(dropdown[i]) for i in selected_dropdown_value)} Over Time",

            xaxis={"title":"Date",

                   'rangeselector': {'buttons': list([{'count': 1, 'label': '1M', 'step': 'month', 'stepmode': 'backward'},

                                                      {'count': 6, 'label': '6M', 'step': 'month', 'stepmode': 'backward'},

                                                      {'step': 'all'}])},

                   'rangeslider': {'visible': True}, 'type': 'date'},yaxis={"title":"Number of Incidents"})}

    return figure

The final app will look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import os

import dash

import dash_core_components as dcc

import dash_html_components as html

import  pyodbc

import pandas as pd

import plotly.graph_objs as go

from app import app

from dash.dependencies import Input, Output

# Setup the ODBC connection to the Dremio instance

host = 'localhost'

port = 31010

uid = 'lucio'

pwd = 'dremio123'

driver = "/Library/Dremio/ODBC/lib/libdrillodbc_sbu.dylib"

cnxn=pyodbc.connect("Driver={};ConnectionType=Direct;HOST={};PORT={};AuthenticationType=Plain;UID={};PWD={}".format(driver,host,port,uid,pwd),autocommit=True)

# SQL style query to retrive the required data from Dremio

cursor = cnxn.cursor()

sql = '''SELECT Fecha, Category, count(*) as incidents FROM "@lucio".IncidentsCurated GROUP BY Category, Fecha order by Fecha'''

sql2= '''SELECT Fecha, Category, PdDistrict, count(*) as incidents FROM "@lucio".IncidentsCurated GROUP BY Category, Fecha, PdDistrict'''

# Executing the query using the alredy open ODBC connection

df = pd.read_sql(sql,cnxn)

df2 = pd.read_sql(sql2,cnxn)

df['Fecha'] = pd.to_datetime(df.Fecha, infer_datetime_format=True)

incidents=len(df.Category)

cat=df['Category'].unique()

res=df2['PdDistrict'].unique()

app_name = 'dash-timeseriesplot'

external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']

app = dash.Dash(__name__, external_stylesheets=external_stylesheets)

app.title = 'Dremio + Dash'

trace1 = go.Bar(x=df2.PdDistrict, y=df2.incidents, name='PdDistrict')

app.layout = html.Div([html.H1("San Francisco Incidents", style={'textAlign': 'center'}),

    dcc.Dropdown(id='my-dropdown', options=[

        {'label': i, 'value': i} for i in df.Category.unique()

    ], multi=True, value=['ASSAULT'], placeholder='Filter by category...'),

    dcc.Graph(id='Lucio'),

    dcc.Graph(

        id='example-graph',

        figure={

            'data': [trace1],

            'layout':

            go.Layout(title='Number of Incidents per District', barmode='stack')

        })

], className="container")

@app.callback(

    Output(component_id='Lucio', component_property='figure'),

    [Input(component_id='my-dropdown', component_property='value')])

def update_graph(selected_dropdown_value):

    dropdown = {"ASSAULT": "ASSAULT","NON-CRIMINAL": "NON-CRIMINAL", "OTHER OFFENSES":"OTHER OFFENSES",

    "LARCENY/THEFT":"LARCENY/THEFT","WARRANTS":"WARRANTS","VEHICLE THEFT":"VEHICLE:THEFT", "SUSPICIOUS OCC":"SUSPICIOUS OCC",

     "BURGLARY":"BURGLARY", "VANDALISM":"VANDALISM","MISSING PERSON":"MISSING PERSON","DRUG/NARCOTIC":"DRUG/NARCOTIC"}

    trace1 = []

    for category in selected_dropdown_value:

        trace1.append(go.Scatter(x=df[df["Category"] == category]["Fecha"],y=df[df["Category"] == category]["incidents"],mode='lines',

            opacity=0.7,name=f' {dropdown[category]}',textposition='bottom center'))

    traces = [trace1]

    data = [val for sublist in traces for val in sublist]

    figure = {'data': data,

        'layout': go.Layout(colorway=["#5E0DAC", '#FF4F00', '#375CB1', '#FF7400', '#FFF400', '#FF0056'],

            height=600,title=f"Incidents per Category {', '.join(str(dropdown[i]) for i in selected_dropdown_value)} Over Time",

            xaxis={"title":"Date",

                   'rangeselector': {'buttons': list([{'count': 1, 'label': '1M', 'step': 'month', 'stepmode': 'backward'},

                                                      {'count': 6, 'label': '6M', 'step': 'month', 'stepmode': 'backward'},

                                                      {'step': 'all'}])},

                   'rangeslider': {'visible': True}, 'type': 'date'},yaxis={"title":"Number of Incidents"})}

    return figure

if __name__ == '__main__':

    app.run_server(debug=True)

Once we save the file and run it in Python, we should see the following results:

 

At this point you can select any of the categories available in the text input to see the interaction between the input and the graphs.

Conclusion

Dash is a very powerful tool that makes data visualization very easy allowing you to put apps together in just a few minutes without the need to write code on other language but python. In this tutorial we walked through the steps to use Dremio to access and curate data located in multiple sources and then we established a connection from Python to Dremio to feed data to our Dash app.

We hope you enjoyed this tutorial, and stay tuned for more.