Elasticsearch is a popular open source datastore that enables developers to query data using a JSON-style domain-specific language, known as the Query DSL. Elasticsearch’s scale-out architecture, JSON data model, and text search capabilities make it an attractive datastore for many applications.
Dremio enables business analysts and data scientists to utilize Elasticsearch without having to learn the Query DSL. Dremio can translate SQL queries into the Query DSL (and Painless scripts), and can augment Elasticsearch’s execution capabilities so that users can run any SQL query, including arbitrary joins, accurate counts (and count distincts) and more. Furthermore, Dremio Reflections can speed up execution of analytical queries by orders of magnitude.
For this tutorial you’ll need access to Dremio and to an Elasticsearch cluster. This is a more advanced tutorial, so we suggest you first read Getting Oriented to Dremio and Working With Your First Dataset if you haven’t already.
For this tutorial we’ll use the public dataset DOHMH New York City Restaurant Inspection Results. This dataset, provided by the New York City Department of Health and Mental Hygiene, includes the inspection results for restaurants in the city.
We can use the following Python script to create a new index and type in Elasticsearch, and to load the CSV file into the index.
Note that in order to run this script, you’ll need to make sure that you have the elasticsearch and pandas libraries installed:
pip install elasticsearch pip install pandas
To make sure that the script ran successfully, run the following command:
(The jq utility is optional. It simply formats JSON so that it’s easier to read.)
Let’s go ahead and connect your Dremio installation to Elasticsearch. This should only take a few seconds. Access the Dremio UI by connecting your browser to http://localhost:9047 (this assumes that you are running Dremio on your laptop, with the default port, for the purpose of this tutorial):
Notice that Dremio is not currently connected to any data sources (see Sources area in the bottom left). Click on the New Source button to add the Elasticsearch cluster as a source. In the New Source dialog, click on the Elasticsearch option:
Enter the coordinates of your Elasticsearch cluster. Assuming you are running Elasticsearch on your laptop, you should enter the following parameters: Name = production (or any other name that you choose) Host = localhost Authentication = No Authentication
Click Save to return to the main screen in the Dremio UI.
Notice that we now have one data source called production, and inside that data source you can see each of the indexes in your Elasticsearch cluster. In our case, we had previously installed Kibana, so you can see the index called .kibana in addition to the index restaurants which we created with our Python script.
Click on the restaurants index:
You’ll notice that there is one dataset inside production.restaurants. In Dremio, each Elasticsearch document type (i.e., mapping) is exposed as its own dataset. The canonical path to our inspection dataset is production.restaurants.inspections (source.index.type).
Click on the inspections dataset to see the actual data in the Dataset Viewer:
You are now ready to connect any SQL-based tool to the Dremio cluster and start working with your Elasticsearch data. You can also run some SQL queries inside Dremio.
Click on the New Query button at the top and enter the following query:
Rather than clicking on the Preview button, you should change it to Run, so that the results are complete and accurate, rather than based on a sample.
Dremio compiled this SQL query, and the optimizer decided to translate most of the plan into an Elasticsearch Query DSL query that could be pushed down into Elasticsearch:
Let’s look at another example that includes free-text search, leveraging one of Elasticsearch’s strengths over traditional relational databases. To do this, we’ll utilize the SQL contains function, which takes a Lucene expression and pushes it down into Elasticsearch. For example, let’s see which cuisines are most likely to have sanitary violations related to mice or other rodents:
Here we’re aggregating all the records that contain the words mice or vermin in the VIOLATION_DESCRIPTION field:
The query pushed down into Elasticsearch in this case was:
You may notice that according to these results, American restaurants are more likely to feature mice and other rodents in the kitchen, but perhaps this is also influenced by the fact that there are more American restaurants than any other cuisine in NYC. We can answer this question by looking at the ratio of rodent-infested restaurants to all restaurants of each cuisine.
Here you can see that Chilean restaurants, and not American restaurants, are the most likely to feature rodents in the kitchen!
Notice that this is a join query. Elasticsearch doesn’t natively support joins, but that doesn’t prevent Dremio from enabling you to run join queries on Elasticsearch data. Dremio leverages its own distributed execution engine, based on Apache Arrow, to augment the execution capabilities of the underlying data source. In this case, Dremio pushed down two separate queries, in parallel, to Elasticsearch, and joined the results to produce the overall results returned to the client.
It’s worth pointing out that Dremio can take advantage of many different capabilities in the Elasticsearch Query DSL. For example, let’s look at all the businesses that were inspected in the month of January 2017:
Here we are filtering on the inspection date (INSPECTION_DATE) and aggregating on the business name (DBA, or “doing business as”).
You’ll notice that Subway and Dunkin’ Donuts were inspected more than any other restaurant. That’s probably because they have many locations.
The query sent to Elasticsearch in this case was:
In some cases, the Elasticsearch Query DSL is not expressive enough for the query, in which case Dremio also has the ability to utilize Painless, the secure scripting language embedded in Elasticsearch. For example, let’s examine the average score based on the month of the year:
High scores are bad, and low scores are good, so it looks like the summer months are particularly bad when it comes to the sanitary conditions in restaurants.
Note that the query pushed down to Elasticsearch in this case included a Painless script in order to extract the month from the inspection date so that Elasticsearch could aggregate on the month:
In this tutorial we showed powerful SQL queries running on Elasticsearch, opening up a wide range of analytical capabilities on key datasets that have traditionally been impossible to access with SQL-based tools. Dremio rewrites SQL queries using a combination of Elastic’s Query DSL and Painless scripts, all automatically. With the ability to run SQL on Elasticsearch, a number of possibilities open up - using BI tools, R, Python, including Jupyter notebooks, and many more tools.
We also learned a few things about where not to eat in New York City!
In our next tutorial we’ll take a look at visualizing Elasticsearch datasets interactively, using Tableau.