Data in its natural form is not that valuable if you cannot visualize it. There are lots of visualization libraries available in the community, which may make it difficult to select one. In this tutorial, we hope to make your selection process a little bit easier by showing you how to work with Dash.
Dash is a productive Python framework for building web applications. Written on top of Flask, Plotly.js, and React.js, Dash gives you the ability to build highly customizable visualization apps to consume your Python data.
In this article, we will use Dremio to connect to two different datasets located in different data sources, then we will connect Python to Dremio and analyze the data and then use Dash to create the visualization app.
I will walk you through the steps of making sure we have everything we need, including installing Dash on your environment.
Assumptions
We recommend that you have the following installed in your environment before continuing:
This is a very simple process, first we need to install the core dash backend by running the following command:
pip install dash==1.1.1
Then we need to install the DAQ components:
pip install dash-daq==0.1.0
Testing Dash
Now let’s make sure our Dash instance is working. Using your text editor of choice, create a file named “dremio_dash.py” and add the following code to it.
import dash
import dash_core_components as dcc
import dash_html_components as html
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
app.layout = html.Div(children=[
html.H1(children='Visualizing Data with Dash - Explained by Dremio'),
html.Div(children='''
Dash: A web application framework for Python.
'''),
dcc.Graph(
id='example-graph',
figure={
'data': [
{'x': [1, 2, 3], 'y': [4, 1, 2], 'type': 'bar', 'name': 'SF'},
{'x': [1, 2, 3], 'y': [2, 4, 5], 'type': 'bar', 'name': u'Montréal'},
],
'layout': {
'title': 'Dash Data Visualization'
}
}
)
])
if __name__ == '__main__':
app.run_server(debug=True)
Then in your terminal, execute the app by running the following command:
python dremio_dash.py
Having trouble? It is possible that you are using the wrong python version, try the same command on python3 instead.
Once you run the app, you should see the following message on your terminal:
Now, in your browser, navigate to http://127.0.0.1:8050 this will open the app that we just created. If everything went as expected, you should see the following rendering on your screen:
All good? Now we are going to get some data ready to visualize.
Preparing the data
Now we are going to create and curate a virtual dataset in Dremio for us to visualize in Dash. First, launch Dremio and log in.
Then click on the plus sign on the sources section to add a new source
Then we will add a new Sample Source, then select the SF_Incidents2016.json set. If you want to join data from multiple data sources, this is the right moment to do so, for more information check out our tutorial: Joining Multiple Data Sources
Tidying up the data
There are a few things we need to do to this dataset before it is fully ready for consumption. You will notice that the date, and time fields are ‘string’ fields indicated by the ABC icon next to them.
We need to convert these fields to their correct format, starting with the date field, select the “ABC” icon next to it and then from the menu select ‘Date’
Then select the format that you want to use, in this case we will use the default YYYY-MM-DD format, then select the option to drop the original field and then click “apply”
Now, repeat the same steps for the ‘Time’ field.
Since the field contains the hours in 24HR format (indicating 13 instead of 1 PM), we will select the second option: HH24:MI and then apply the changes
Save your work!
Since we have made several changes to the dataset now we can save it as a different VDS which will we use later in the visualization. Click on “Save as” and give it a name: sf_incidents_curated and click “save”
And that’s it! Now, we are ready to connect to Python and start building our visualization app.
Setting up Python
Now we need to make sure that we have all the components ready to connect Python to Dremio.
If you haven’t done so, please install the latest available version of Pandas
>pip install pandas
Similarly, PyODBC, a Python package to allow us to use ODBC drivers will be required. To see if PyODBC is already installed, try importing it inside Python
Import pyodbc
Pyodbc.version
>>'4.0.24'
If importing the package fails, it can be installed using pip
> pip install pyodbc
Setting up the Dremio Connector
Now that everything has been set up, we can test is everything is up and running by executing the following piece of code.
Note: I will use a Jupyter notebook and Python3 to run the following test, however, you can just create a .py file with the following code and run the app once the code is complete.
import pyodbc
import pandas as pd
#Setup the ODBC connection to the Dremio instance**
host = 'localhost'
port = 31010
uid = 'lucio'
pwd = 'dremio123'
driver = "/Library/Dremio/ODBC/lib/libdrillodbc_sbu.dylib"
cnxn=pyodbc.connect("Driver={};ConnectionType=Direct;HOST={};PORT={};AuthenticationType=Plain;UID={};PWD={}".format(driver,host,port,uid,pwd),autocommit=True)
# SQL style query to retrive the required data from Dremio**
cursor = cnxn.cursor()
sql = '''SELECT IncidntNum, Category, Descript FROM "@lucio".sf_incidents_curated limit 20'''
# Executing the query using the alredy open ODBC connection
test_data = pd.read_sql(sql, cnxn)
# Disaplying the top 5 rows from the retrieved dataset**
test_data.head()
Notice that I’m using the credential and connection information for my environment, you would have to edit the host, user name, password and also the path for the dataset to match your environment’s info.
Once the script runs successfully, you should be able to see the following results:
Visualizing using Dash
Now that we have everything in place, it is time for us to build our visualization app using Dash.
First, we are going to import all the necessary libraries for the app.
import dash
import dash_core_components as dcc
import dash_html_components as html
import pyodbc
import pandas as pd
import plotly.graph_objs as go
from app import app
from dash.dependencies import Input, Output
Then I’m going to set up the queries that I’m going to use to feed the graphs.
# SQL style query to retrive the required data from Dremio
cursor = cnxn.cursor()
sql = '''SELECT Fecha, Category, count(*) as incidents FROM "@lucio".IncidentsCurated GROUP BY Category, Fecha order by Fecha'''
sql2= '''SELECT Fecha, Category, PdDistrict, count(*) as incidents FROM "@lucio".IncidentsCurated GROUP BY Category, Fecha, PdDistrict'''
Now, I will set up the data frames that I will use for each graph.
Then I will build the app layout and bring some external stylesheets
app_name = 'dash-timeseriesplot'
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
app.title = 'Dremio + Dash'
trace1 = go.Bar(x=df2.PdDistrict, y=df2.incidents, name='PdDistrict')
app.layout = html.Div([html.H1("San Francisco Incidents", style={'textAlign': 'center'}),
dcc.Dropdown(id='my-dropdown', options=[
{'label': i, 'value': i} for i in df.Category.unique()
], multi=True, value=['ASSAULT'], placeholder='Filter by category...'),
dcc.Graph(id='Lucio'),
dcc.Graph(
id='example-graph',
figure={
'data': [trace1],
'layout':
go.Layout(title='Number of Incidents per District', barmode='stack')
})
], className="container")
In the previous block, I’m creating an HTML layout that will contain three different objects:
A text input to include all the incident categories
A line chart
A bar chart
Each one of these elements will be updated by the items selected in the text input using the following function:
@app.callback(
Output(component_id='Lucio', component_property='figure'),
[Input(component_id='my-dropdown', component_property='value')])
def update_graph(selected_dropdown_value):
dropdown = {"ASSAULT": "ASSAULT","NON-CRIMINAL": "NON-CRIMINAL", "OTHER OFFENSES":"OTHER OFFENSES",
"LARCENY/THEFT":"LARCENY/THEFT","WARRANTS":"WARRANTS","VEHICLE THEFT":"VEHICLE:THEFT", "SUSPICIOUS OCC":"SUSPICIOUS OCC",
"BURGLARY":"BURGLARY", "VANDALISM":"VANDALISM","MISSING PERSON":"MISSING PERSON","DRUG/NARCOTIC":"DRUG/NARCOTIC"}
trace1 = []
for category in selected_dropdown_value:
trace1.append(go.Scatter(x=df[df["Category"] == category]["Fecha"],y=df[df["Category"] == category]["incidents"],mode='lines',
opacity=0.7,name=f' {dropdown[category]}',textposition='bottom center'))
traces = [trace1]
data = [val for sublist in traces for val in sublist]
figure = {'data': data,
'layout': go.Layout(colorway=["#5E0DAC", '#FF4F00', '#375CB1', '#FF7400', '#FFF400', '#FF0056'],
height=600,title=f"Incidents per Category {', '.join(str(dropdown[i]) for i in selected_dropdown_value)} Over Time",
xaxis={"title":"Date",
'rangeselector': {'buttons': list([{'count': 1, 'label': '1M', 'step': 'month', 'stepmode': 'backward'},
{'count': 6, 'label': '6M', 'step': 'month', 'stepmode': 'backward'},
{'step': 'all'}])},
'rangeslider': {'visible': True}, 'type': 'date'},yaxis={"title":"Number of Incidents"})}
return figure
The final app will look like this:
import os
import dash
import dash_core_components as dcc
import dash_html_components as html
import pyodbc
import pandas as pd
import plotly.graph_objs as go
from app import app
from dash.dependencies import Input, Output
# Setup the ODBC connection to the Dremio instance
host = 'localhost'
port = 31010
uid = 'lucio'
pwd = 'dremio123'
driver = "/Library/Dremio/ODBC/lib/libdrillodbc_sbu.dylib"
cnxn=pyodbc.connect("Driver={};ConnectionType=Direct;HOST={};PORT={};AuthenticationType=Plain;UID={};PWD={}".format(driver,host,port,uid,pwd),autocommit=True)
# SQL style query to retrive the required data from Dremio
cursor = cnxn.cursor()
sql = '''SELECT Fecha, Category, count(*) as incidents FROM "@lucio".IncidentsCurated GROUP BY Category, Fecha order by Fecha'''
sql2= '''SELECT Fecha, Category, PdDistrict, count(*) as incidents FROM "@lucio".IncidentsCurated GROUP BY Category, Fecha, PdDistrict'''
# Executing the query using the alredy open ODBC connection
df = pd.read_sql(sql,cnxn)
df2 = pd.read_sql(sql2,cnxn)
df['Fecha'] = pd.to_datetime(df.Fecha, infer_datetime_format=True)
incidents=len(df.Category)
cat=df['Category'].unique()
res=df2['PdDistrict'].unique()
app_name = 'dash-timeseriesplot'
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
app.title = 'Dremio + Dash'
trace1 = go.Bar(x=df2.PdDistrict, y=df2.incidents, name='PdDistrict')
app.layout = html.Div([html.H1("San Francisco Incidents", style={'textAlign': 'center'}),
dcc.Dropdown(id='my-dropdown', options=[
{'label': i, 'value': i} for i in df.Category.unique()
], multi=True, value=['ASSAULT'], placeholder='Filter by category...'),
dcc.Graph(id='Lucio'),
dcc.Graph(
id='example-graph',
figure={
'data': [trace1],
'layout':
go.Layout(title='Number of Incidents per District', barmode='stack')
})
], className="container")
@app.callback(
Output(component_id='Lucio', component_property='figure'),
[Input(component_id='my-dropdown', component_property='value')])
def update_graph(selected_dropdown_value):
dropdown = {"ASSAULT": "ASSAULT","NON-CRIMINAL": "NON-CRIMINAL", "OTHER OFFENSES":"OTHER OFFENSES",
"LARCENY/THEFT":"LARCENY/THEFT","WARRANTS":"WARRANTS","VEHICLE THEFT":"VEHICLE:THEFT", "SUSPICIOUS OCC":"SUSPICIOUS OCC",
"BURGLARY":"BURGLARY", "VANDALISM":"VANDALISM","MISSING PERSON":"MISSING PERSON","DRUG/NARCOTIC":"DRUG/NARCOTIC"}
trace1 = []
for category in selected_dropdown_value:
trace1.append(go.Scatter(x=df[df["Category"] == category]["Fecha"],y=df[df["Category"] == category]["incidents"],mode='lines',
opacity=0.7,name=f' {dropdown[category]}',textposition='bottom center'))
traces = [trace1]
data = [val for sublist in traces for val in sublist]
figure = {'data': data,
'layout': go.Layout(colorway=["#5E0DAC", '#FF4F00', '#375CB1', '#FF7400', '#FFF400', '#FF0056'],
height=600,title=f"Incidents per Category {', '.join(str(dropdown[i]) for i in selected_dropdown_value)} Over Time",
xaxis={"title":"Date",
'rangeselector': {'buttons': list([{'count': 1, 'label': '1M', 'step': 'month', 'stepmode': 'backward'},
{'count': 6, 'label': '6M', 'step': 'month', 'stepmode': 'backward'},
{'step': 'all'}])},
'rangeslider': {'visible': True}, 'type': 'date'},yaxis={"title":"Number of Incidents"})}
return figure
if __name__ == '__main__':
app.run_server(debug=True)
Once we save the file and run it in Python, we should see the following results:
At this point you can select any of the categories available in the text input to see the interaction between the input and the graphs.
Conclusion
Dash is a very powerful tool that makes data visualization very easy allowing you to put apps together in just a few minutes without the need to write code on other language but python. In this tutorial we walked through the steps to use Dremio to access and curate data located in multiple sources and then we established a connection from Python to Dremio to feed data to our Dash app.
We hope you enjoyed this tutorial, and stay tuned for more.
Ready to Get Started? Here Are Some Resources to Help
Webinars
Cyber Lakehouse for the AI Era, ZTA and Beyond
Many agencies today are struggling not only with managing the scale and complexity of cyber data but also with extracting actionable insights from that data. With new data retention regulations, such as M-21-31, compounding this problem further, agencies need a next-generation solution to address these challenges.
Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.