Compiling SQL to Elasticsearch Painless

Dremio Authors: Insights and Perspectives Dremio Team

Intro

The Elasticsearch Query DSL is a powerful and simple way to express queries in Elasticsearch using JSON. Painless is a simple, secure scripting language for inline and stored scripts. When considered together, it is possible to map most SQL queries to Elasticsearch efficiently and with high performance.

In this tutorial we will look at how core SQL concepts for data structures and operations map to equivalent concepts in Elasticsearch. We will also explore how different SQL operators can be automatically mapped to equivalent Elasticsearch expressions, include numeric, string, date, Boolean, and set operators, through the DSL and Painless scripts.

Dremio allows business analysts and data scientists to use Elasticsearch without having to learn Query DSL. Dremio automatically translates SQL queries into Query DSL (and Painless scripts), and can augment Elasticsearch’s execution capabilities so that users can run any SQL query, including arbitrary joins, accurate counts (and count distincts) and more. Also, data reflections can speed up execution of analytical queries by orders of magnitude.

Assumptions

For this tutorial we assume you’re familiar with Elasticsearch and know how to load data into your cluster. We recommend you first read Getting Oriented to Dremio and Working With Your First Dataset, and you might also review Unlocking SQL on Elasticsearch and Unlocking Tableau on Elasticsearch.

Mapping SQL Concepts to Elasticsearch

If you come from the traditional world of relational databases, it can be helpful to think about how the core data model of Elasticsearch is similar to that of SQL, and how it is different. Here are several of the high-level concepts:

Relational	Elasticsearch
Schema	Mapping
Database	Index
Table	Type
Row	Document
Column	Field
SQL	DSL+Painless
Aggregation	Aggregation
Projection	Projection
Boolean	Boolean
Primary Key	_id Field
Join	Does not exist
Foreign Key	Does not exist

One of the most important differences is that Elasticsearch does not support joins. Instead of normalizing data, users are encouraged to de-normalize their records whenever possible. For this reason perhaps there are no foreign keys either. While joins are computationally expensive operations, there are some situations where they are still useful, as we will see in a few of the examples.

Note: at the time of writing, Types are still supported. Elasticsearch has communicated to their community that these are being deprecated in future releases. Presumably the Index will take on the role of the abstraction that is most like a relational table.

Download the Sample Data

To start, let’s download the raw data

You should see five files:

yelp_academic_dataset_business.json
yelp_academic_dataset_checkin.json
yelp_academic_dataset_review.json
yelp_academic_dataset_tip.json
yelp_academic_dataset_user.json

The data inside these files is stored in JSON Lines format, also known as newline-delimited JSON, where each line contains a JSON document. For example, the first line in the business dataset is:

$  head -n 1 yelp_academic_dataset_business.json
{"business_id": "vcNAWiLM4dR7D2nwwJ7nCA", "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018", "hours": {"Tuesday": {"close": "17:00", "open": "08:00"}, "Friday": {"close": "17:00", "open": "08:00"}, "Monday": {"close": "17:00", "open": "08:00"}, "Wednesday": {"close": "17:00", "open": "08:00"}, "Thursday": {"close": "17:00", "open": "08:00"}}, "open": true, "categories": ["Doctors", "Health & Medical"], "city": "Phoenix", "review_count": 7, "name": "Eric Goldberg, MD", "neighborhoods": [], "longitude": -111.98375799999999, "state": "AZ", "stars": 3.5, "latitude": 33.499313000000001, "attributes": {"By Appointment Only": true}, "type": "business"}

Use jq as an easy way to pretty print this one-line JSON document:

$  head -n 1 yelp_academic_dataset_business.json | jq .
{
  "business_id": "vcNAWiLM4dR7D2nwwJ7nCA",
  "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018",
  "hours": {
    "Tuesday": {
      "close": "17:00",
      "open": "08:00"
    },
    "Friday": {
      "close": "17:00",
      "open": "08:00"
    },
    "Monday": {
      "close": "17:00",
      "open": "08:00"
    },
    "Wednesday": {
      "close": "17:00",
      "open": "08:00"
    },
    "Thursday": {
      "close": "17:00",
      "open": "08:00"
    }
  },
  "open": true,
  "categories": [
    "Doctors",
    "Health & Medical"
  ],
  "city": "Phoenix",
  "review_count": 7,
  "name": "Eric Goldberg, MD",
  "neighborhoods": [],
  "longitude": -111.983758,
  "state": "AZ",
  "stars": 3.5,
  "latitude": 33.499313,
  "attributes": {
    "By Appointment Only": true
  },
  "type": "business"
}

Connecting to Elasticsearch

Access the Dremio UI by connecting your browser to http://localhost:9047 (this assumes that you are running Dremio on your laptop machine, with the default port, for the purpose of this tutorial):

In our example, Dremio is not yet connected to any data sources (see Sources area in the bottom left). Click on the New Source button to add the Elasticsearch cluster as a source. In the New Source dialog, click on the Elasticsearch option:

Enter the coordinates of your Elasticsearch cluster. Assuming you are running Elasticsearch on your laptop, you should enter the following parameters: Name = ES5 (or any other name that you choose) Host = localhost Authentication = No Authentication

Click Save to return to Dremio’s main screen.

Notice that we now have one index called Yelp, and inside you can see each of the types for this index, including business, review, user, and checkin, as well as any other types you loaded. Dremio uses a hierarchical namespace model, and the canonical path to each type is source.index.type (eg, ES5.yelp.business).

Click on the yelp index:

Click on the business dataset to see the data in the Dataset Viewer:

Running SQL Queries in Elasticsearch

For the following examples, feel free to use any SQL-based client. There’s a list of supported tools here, but most things should work fine over ODBC/JDBC.

Alternately, you can simply use the SQL window in Dremio to issue queries and see results. Be aware that clicking the “Preview” button will only show you a sample of the results (which is always fast), and if you want to see the full results you should instead click “Run” (sometimes slow, depending on the data source and usually fast when you have Reflections enabled). In this tutorial we’ll assume you’re using Dremio to issue SQL queries.

Basic SQL Push-Downs

From the business dataset viewer window, Click on the New Query button at the top and enter the following query:

SELECT city, COUNT(*) AS c
FROM elastic.yelp.business
GROUP BY city
ORDER BY c DESC

Be sure to click Run. Depending on the precise data you’ve downloaded from Yelp, Las Vegas (13,599), Phoenix (8,410), and Charlotte (4224) are the top 3.

Dremio compiled this SQL query, and the optimizer decided to translate most of the plan into an Elasticsearch Query DSL query that could be pushed down into Elasticsearch:

{
  "size" : 0,
  "query" : {
    "match_all" : { }
  },
  "aggregations" : {
    "city" : {
      "terms" : {
        "field" : "city",
        "missing" : "NULL_STRING_TAG",
        "size" : 2147483647
      }
    }
  }
}

The way to see the expression that is pushed-down into a source in Dremio is to look at the query profile. There’s a tutorial on how to find your query profile. Once you’ve found it, you should see something like the following on the right:

Then click on the the Profile link to see the query profile, then the Planning tab:

There’s tons of useful information here, and it’s easy to be overwhelmed. Scroll down to the section called “Final Physical Transformation” where you should see something like this:

This isn’t as easy to read as it could be, and we’ll make it easier in a future release, but you can still find the query here if you want to.

As we can see this is a simple match all query, with an aggregation. Let’s try adding the state column and a few predicates:

SELECT state, city, COUNT(*) AS c
FROM elastic.yelp.business
WHERE
  state NOT IN ('TX','UT','NM','NJ') AND
  review_count > 100
GROUP BY state, city
ORDER BY c DESC

The resulting DSL is as follows:

{
  "size" : 0,
  "query" : {
    "bool" : {
      "must" : [ {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "TX",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "UT",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "NM",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "NJ",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "range" : {
          "review_count" : {
            "from" : 100,
            "to" : null,
            "include_lower" : false,
            "include_upper" : true
          }
        }
      } ]
    }
  },
  "aggregations" : {
    "state" : {
      "terms" : {
        "field" : "state",
        "missing" : "NULL_STRING_TAG",
        "size" : 2147483647
      },
      "aggregations" : {
        "city" : {
          "terms" : {
            "field" : "city",
            "missing" : "NULL_STRING_TAG",
            "size" : 2147483647
          }
        }
      }
    }
  }
}

Now let’s try a query without aggregation that includes projection, sorting, and a few predicates:

SELECT state, city, name, review_count
FROM elastic.yelp.business
WHERE
  state NOT IN ('TX','UT','NM','NJ') AND
  review_count > 100
ORDER BY review_count DESC, state, city
LIMIT 10

This time the DSL is very similar, but the LIMIT is being performed in Dremio:

{
  "from" : 0,
  "size" : 4000,
  "query" : {
    "bool" : {
      "must" : [ {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "TX",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "UT",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "NM",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "NJ",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "range" : {
          "review_count" : {
            "from" : 100,
            "to" : null,
            "include_lower" : false,
            "include_upper" : true
          }
        }
      } ]
    }
  }
}

If you’re curious how the fields determined when expressing the query, that is included just above in the query profile. In this case you can see the line just before the DSL:

ElasticScan(resource=[yelp/business], columns=[[city, review_count, name, state]], pushdown

If you want to see exactly what queries are being issued to Elasticsearch by Dremio, you can enable the slowlog.log feature in Elasticsearch to write out all the queries whose processing time exceeds a threshold. Be sure to set it back to a reasonable setting when you’re done. 🙂

Windowing Functions

Taking the last examples a little further, there are common query patterns that benefit from more recent additions to SQL, such as Windowing. If you’re working with time-series data, for example, Windowing Functions are essential as they allow you to make sense of patterns within data that are very difficult or inefficient to otherwise analyze.

For a simple example, consider the restaurant reviews we were looking at before. It is very simple to assess the count of restaurant reviews per restaurant as shown above. But what if we wanted to understand the total number of reviews as a percentage of all reviews? In essence, calculate the total number of reviews for each restaurant, then compare that to the total number of reviews across all restaurants.

This is pretty easy if you’re willing to do multiple selects from your application, but that doesn’t scale as easily as it could, and coordinating predicates across queries can become fairly complex. With Windowing things are easier once you master the syntax. Here’s one example:

SELECT
  city,
  c,
  (c*1.0/total)*100 AS pct
FROM (
  SELECT
    city,
    c,
    SUM(c) OVER () AS total
  FROM (
    SELECT
      city,
      COUNT(*) AS c
    FROM
      elastic.yelp.business
    GROUP BY
      city
  )
ORDER BY
  pct DESC

This is compiled into Elastic DSL as the following:

{
  "size" : 0,
  "query" : {
    "match_all" : { }
  },
  "aggregations" : {
    "city" : {
      "terms" : {
        "field" : "city",
        "missing" : "NULL_STRING_TAG",
        "size" : 2147483647
      }
    }
  }
}

In this case Dremio pushes the basic aggregation calculation for all cities into Elasticsearch, and calculates the percentage of the whole on its own.

Here’s a slightly more complex example where the number of reviews and the average score is compared to the aggregate score for each city and across all cities when businesses have more than 100 reviews. You can see that calculating the same using multiple queries would be much more challenging in this case.

SELECT
  city,
  name,
  bus_review_count,
  bus_avg_stars,
  city_avg_stars,
  all_avg_stars
FROM (
  SELECT
    city,
    name,
    bus_review_count,
    bus_avg_stars,
    AVG(bus_avg_stars) OVER (PARTITION BY city) as city_avg_stars,
    AVG(bus_avg_stars) OVER () as all_avg_stars,
    SUM(bus_review_count) OVER () AS total_reviews
  FROM (
    SELECT
      city,
      name,
      AVG(stars) as bus_avg_stars,
      SUM(review_count) AS bus_review_count
  	FROM
    	elastic.yelp.business
  	GROUP BY
    	city, name
  	)
  )
WHERE
  bus_review_count > 100
ORDER BY
  bus_avg_stars DESC,
  bus_review_count DESC

The results here are a little surprising - it looks like there are a number of restaurants with literally hundreds of 5 star reviews across Las Vegas, Phoenix, Pittsburg, and Scottsdale.

How can this be? It turns out there is one record per business, and the number of reviews and stars are aggregations themselves. To get an accurate sense of the total number of reviews and average stars per business we need to join to the reviews type, an entirely different set of records in our Elasticsearch cluster.

While Elasticsearch doesn’t support joins, Dremio makes this pretty simple. We can modify our query slightly to perform a left outer join and change our aggregates to use fields in the reviews data:

SELECT
  city,
  name,
  bus_review_count,
  bus_avg_stars,
  city_avg_stars,
  all_avg_stars
FROM (
  SELECT
    city,
    name,
    bus_review_count,
    bus_avg_stars,
    AVG(bus_avg_stars) OVER (PARTITION BY city) AS city_avg_stars,
    AVG(bus_avg_stars) OVER () AS all_avg_stars,
    SUM(bus_review_count) OVER () AS total_reviews
  FROM (
    SELECT
      city,
      name,
      AVG(review.stars) AS bus_avg_stars,
      COUNT(review.review_id) AS bus_review_count
    FROM
      elastic.yelp.business AS business
      LEFT OUTER JOIN elastic.yelp.review AS review ON business.business_id = review.business_id
    GROUP BY
      city, name
  )
)
WHERE bus_review_count > 100
ORDER BY bus_avg_stars DESC, bus_review_count DESC

And now our results look a lot more believable:

Top Rated Restaurants Comparison, take 2

Painless Expressions

In some cases, the Elasticsearch Query DSL is not expressive enough for the query, in which case Dremio also has the ability to utilize Painless, the secure scripting language embedded in Elasticsearch. For example, if we query the reviews and want to see if there are months where there are any seasonal differences in terms of number of reviews, we could issue the following query:

SELECT EXTRACT(MONTH FROM "date") as my_month, COUNT( * ) AS c
FROM "elastic-remote".yelp.review
GROUP BY EXTRACT(MONTH FROM "date")
ORDER BY c DESC

Sure enough, there are about 50% more reviews in August than in February:

Each review has a datetime, so we need to extract the month from this value. There are certain SQL expressions that are pushed-down as Painless scripts when used in a filter or aggregation expression. Extracting parts of time from date and datetime values is one example. This SQL produces the following query for Elasticsearch:

{
  "size" : 0,
  "query" : {
    "match_all" : { }
  },
  "aggregations" : {
    "my_month" : {
      "terms" : {
        "script" : {
          "inline": "(def) ((doc[\"date\"].empty) ? null : doc[\"date\"].date.monthOfYear)",          
          "lang" : "painless"
        },
        "missing" : -9223372036854775808,
        "size" : 2147483647
      }
    }
  }
}

In Elasticsearch 2.x you will see something similar, compiled to Groovy:

{
  "size" : 0,
  "query" : {
    "match_all" : { }
  },
  "aggregations" : {
    "my_month" : {
      "terms" : {
        "script" : {
          "inline" : "(doc[\"date\"].empty) ? null : doc[\"date\"].date.monthOfYear",
          "lang" : "groovy"
        },
        "missing" : -9223372036854775808,
        "size" : 2147483647
      }
    }
  }
}

Other types of expressions that are pushed down as Painless include:

Math operators
String operators
Numeric operators
Date operators
Case statements
Existence Functions
Cast Functions

Keep in mind these need to be used in a filter or aggregation to be pushed down, otherwise Dremio will perform these types of expressions in it’s own SQL engine.

The good news is that all of these options are transparent to you as a user - you send your SQL to Dremio, and it decides the best way to optimize the query.

Conclusion

In this tutorial we showed a number of different SQL expressions and how they are compiled into Elasticsearch queries, including DSL and Painless (and Groovy for 2.x). With Dremio users can apply existing skills in SQL to unlock the powerful features of Elasticsearch. They can also use their favorite SQL based tools, such as Tableau, Power BI, and Qlik with Elasticsearch, unlocking the data in these clusters to a new audience, without making copies or using any ETL.

Next Steps

As a next step you should take a look at Dremio’s data reflections as a way to offload these analytical queries from your Elasticsearch cluster, and to speed up analytical queries dramatically.

Also, for more detail, take a look at our docs on the Elasticsearch topic.

Compiling SQL to Elasticsearch Painless

Table of Contents

Table of Contents

Intro

Assumptions

Mapping SQL Concepts to Elasticsearch

Download the Sample Data

Connecting to Elasticsearch

Running SQL Queries in Elasticsearch

Basic SQL Push-Downs

Windowing Functions

Painless Expressions

Conclusion

Next Steps

Ready to Get Started? Here Are Some Resources to Help

Whitepaper

Dremio Upgrade Testing Framework

Whitepaper

Operating Dremio Cloud Runbook

Webinars

Unlock the Power of a Data Lakehouse with Dremio Cloud

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?