Dremio Jekyll

Lucene Expression Push-Downs into Elasticsearch via SQL with Dremio

Intro

Elasticsearch is a popular and powerful search engine based on Apache Lucene. We’ve shown in Unlocking SQL on Elasticsearch and Unlocking Tableau on Elasticsearch that Dremio makes it easy to use popular BI and visualization tools with this technology. In this tutorial we’ll focus on Dremio’s ability to push Lucene syntax into Elasticsearch via SQL, giving users the best of both worlds. Using standard SQL, any tool can query the rich data structures stored in Elasticsearch, as well as take advantage of the powerful search syntax from within the SQL CONTAINS operator.

Downloading our data

To start, let’s download the raw data from Yelp. Head over to https://www.yelp.com/dataset_challenge/dataset and click on the Download Data button to download the dataset. The downloaded tarball should be about 2GB.

Yelp data source

After you untar the file, you should see six files:

1
2
3
4
5
6
7
$du -sh *
126M	business.json
 57M	checkin.json
 23M	photos.json
3.6G	review.json
176M	tip.json
1.5G	user.json

The data inside these files is stored in JSON Lines format, also known as newline-delimited JSON (NDJSON), where each line contains a JSON document. For example, the first line in the business dataset is:

1
2
$  head -n 1 yelp_academic_dataset_business.json
{"business_id": "vcNAWiLM4dR7D2nwwJ7nCA", "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018", "hours": {"Tuesday": {"close": "17:00", "open": "08:00"}, "Friday": {"close": "17:00", "open": "08:00"}, "Monday": {"close": "17:00", "open": "08:00"}, "Wednesday": {"close": "17:00", "open": "08:00"}, "Thursday": {"close": "17:00", "open": "08:00"}}, "open": true, "categories": ["Doctors", "Health & Medical"], "city": "Phoenix", "review_count": 7, "name": "Eric Goldberg, MD", "neighborhoods": [], "longitude": -111.98375799999999, "state": "AZ", "stars": 3.5, "latitude": 33.499313000000001, "attributes": {"By Appointment Only": true}, "type": "business"}

Use jq as an easy way to pretty print this one-line JSON document:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
$  head -n 1 yelp_academic_dataset_business.json | jq .
{
  "business_id": "vcNAWiLM4dR7D2nwwJ7nCA",
  "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018",
  "hours": {
    "Tuesday": {
      "close": "17:00",
      "open": "08:00"
    },
    "Friday": {
      "close": "17:00",
      "open": "08:00"
    },
    "Monday": {
      "close": "17:00",
      "open": "08:00"
    },
    "Wednesday": {
      "close": "17:00",
      "open": "08:00"
    },
    "Thursday": {
      "close": "17:00",
      "open": "08:00"
    }
  },
  "open": true,
  "categories": [
    "Doctors",
    "Health & Medical"
  ],
  "city": "Phoenix",
  "review_count": 7,
  "name": "Eric Goldberg, MD",
  "neighborhoods": [],
  "longitude": -111.983758,
  "state": "AZ",
  "stars": 3.5,
  "latitude": 33.499313,
  "attributes": {
    "By Appointment Only": true
  },
  "type": "business"
}

Load these files into your Elasticsearch cluster using your favorite script or tool. For example, you might want to use the Bulk API. This API doesn’t accept NDJSON. Instead, it expects a structure where the index and type are declared before each record.

You can use jq to convert the downloaded files into this structure. Here’s an example with the user.json file:

1
cat user.json | jq -c '{ index: { _index:"yelp", _type: "user" } }, . ' > user_bulk.json

You can also pipe the output of jq into the Bulk API:

1
cat user.json | jq -c '{ index: { _index:"yelp", _type: "user" } }, . ' | curl -XPOST localhost:9200/_bulk --data-binary @-

Another option is to use esbulk, a bulk loader written in Go. This loader accepts NDJSON as input, so no re-formatting of the downloaded files is required:

1
./esbulk -index yelp -type  /Users/henrystirman/Downloads/dataset/user.json

For this tutorial we will use the business and review datasets, but feel free to load everything.

Connecting to Elasticsearch

Access the Dremio UI by connecting your browser to http://localhost:9047 (this assumes that you are running Dremio on your local machine, with the default port, for the purpose of this tutorial):

Dremio with no connected data sources

In our example, Dremio is not yet connected to any data sources (see Sources area in the bottom left). Click on the New Source button to add the Elasticsearch cluster as a source. In the New Source dialog, click on the Elasticsearch option:

List of data source types

Enter the coordinates of your Elasticsearch cluster. Assuming you are running Elasticsearch on your laptop without authentication, you should enter the following parameters: Name = ES5 (or any other name that you choose) Host = localhost Authentication = No Authentication

Connect to Elasticsearch

Click Save to return to Dremio’s main screen.

Notice that we now have one index called Yelp, and inside you can see each of the types for this index, including business, review, user, and checkin, as well as any other types you loaded. Dremio uses a hierarchical namespace model, and the canonical path to each type is source.index.type (eg, ES5.yelp.business).

Click on the yelp index:

Contents of the Yelp index

Click on the reviews dataset to see the data in the Dataset Viewer:

Contents of the business type

Running SQL Queries in Elasticsearch

For the following examples, feel free to use any SQL-based client. There’s a list of supported tools here, but most things should work fine over ODBC/JDBC.

Alternately, you can simply use the SQL window in Dremio to issue queries and see results. Be aware that clicking the “Preview” button will only show you a sample of the results (which is always fast), and if you want to see the full results you should instead click “Run” (sometimes slow, depending on the data source and usually fast when you have Reflections enabled). In this tutorial we’ll assume you’re using Dremio to issue the SQL queries.

Basic SQL Push-Downs

From the business dataset viewer window, Click on the New Query button at the top and enter the following query:

1
2
3
4
SELECT city, COUNT(*) AS c
FROM elastic.yelp.business
GROUP BY city
ORDER BY c DESC

Be sure to click Run. Depending on the precise data you’ve downloaded from Yelp, Las Vegas (13,599), Phoenix (8,410), and Charlotte (4224) are the top 3.

Results of SQL query on Elasticsearch

Dremio compiled this SQL query, and the optimizer decided to translate most of the plan into an Elasticsearch Query DSL query that could be pushed down into Elasticsearch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
  "size" : 0,
  "query" : {
    "match_all" : { }
  },
  "aggregations" : {
    "city" : {
      "terms" : {
        "field" : "city",
        "missing" : "NULL_STRING_TAG",
        "size" : 2147483647
      }
    }
  }
}

The way to see the expression that is pushed-down into a source in Dremio is to look at the query profile. There’s a tutorial on how to find your query profile. Once you’ve found it, you should see something like the following on the right:

Job History

Then click on the the Profile link to see the query profile, then the Planning tab:

Query Profile

There’s tons of useful information here, and it’s easy to be overwhelmed. Scroll down to the section called “Final Physical Transformation” where you should see something like this:

Query Profile

This isn’t as easy to read as it could be, and we’ll make it easier in a future release, but you can still find the query here if you want to.

As we can see this is a simple match all query, with an aggregation. Let’s try adding the state column and a few predicates:

1
2
3
4
5
6
7
SELECT state, city, COUNT(*) AS c
FROM elastic.yelp.business
WHERE 
  state NOT IN ('TX','UT','NM','NJ') AND
  review_count > 100
GROUP BY state, city
ORDER BY c DESC

The resulting DSL is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
{
  "size" : 0,
  "query" : {
    "bool" : {
      "must" : [ {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "TX",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "UT",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "NM",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "NJ",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "range" : {
          "review_count" : {
            "from" : 100,
            "to" : null,
            "include_lower" : false,
            "include_upper" : true
          }
        }
      } ]
    }
  },
  "aggregations" : {
    "state" : {
      "terms" : {
        "field" : "state",
        "missing" : "NULL_STRING_TAG",
        "size" : 2147483647
      },
      "aggregations" : {
        "city" : {
          "terms" : {
            "field" : "city",
            "missing" : "NULL_STRING_TAG",
            "size" : 2147483647
          }
        }
      }
    }
  }
}

Using Lucene Expressions

To use a Lucene expression as a filter on a specific column via SQL, pass the expression into a specific field using the CONTAINS operator. For example, to search for businesses that have reviews with the word “amazing”, you would issue the following SQL query. Note - in the following examples the name of the column that includes the review is called “text”:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SELECT 
  business.name, 
  business.full_address, 
  business.city, 
  business.state, 
  review.text, 
  review."date" AS "date", 
  review.user_id AS user_id, 
  review.stars AS stars, 
  review.text AS text
FROM 
  elastic.yelp.business AS business
INNER JOIN 
  elastic.yelp.review AS review ON business.business_id = review.business_id
WHERE CONTAINS(text:'amazing')

This query performs an inner joins between the index type with businesses and the index type with reviews (each business has 0 or more reviews). As we explored in other Elasticsearch tutorials, there’s a lot you can do with SQL, so we’re going to focus simply on the CONTAINS predicate.

Let’s try looking for businesses with reviews that contain positive terms without mentioning negative terms. Here we are combining Boolean operators and grouping to match on positive terms but exclude negative terms:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SELECT 
  business.name, 
  business.full_address, 
  business.city, 
  business.state, 
  review.text, 
  review."date" AS "date", 
  review.user_id AS user_id, 
  review.stars AS stars, 
  review.text AS text
FROM 
  elastic.yelp.business AS business
INNER JOIN 
  elastic.yelp.review AS review ON business.business_id = review.business_id
WHERE CONTAINS(text:'amazing OR awesome OR terrific OR fantastic NOT (awful OR terrible OR disgusting OR horrible)')

Lucene allows us to use term position to ensure words appear near one another. For example, matching variations like “this place was amazing” and “we loved that place. It is totally amazing.”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SELECT 
  business.name, 
  business.full_address, 
  business.city, 
  business.state, 
  review.text, 
  review."date" AS "date", 
  review.user_id AS user_id, 
  review.stars AS stars, 
  review.text AS text
FROM 
  elastic.yelp.business AS business
INNER JOIN 
  elastic.yelp.review AS review ON business.business_id = review.business_id
WHERE CONTAINS(text:'"place amazing"~4 NOT (awful OR terrible)') 

Another useful feature is the ability to use regular expressions. For example, all the variations of spelling for Jonathan, but exclude reviews that mention John or Michael:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SELECT 
  business.name, 
  business.full_address, 
  business.city, 
  business.state, 
  review.text, 
  review."date" AS "date", 
  review.user_id AS user_id, 
  review.stars AS stars, 
  review.text AS text
FROM 
  elastic.yelp.business AS business
INNER JOIN 
  elastic.yelp.review AS review ON business.business_id = review.business_id
WHERE CONTAINS(text:'/joh?n(ath[oa]n)/ -john -michael')

At the time of writing a few things don’t work in terms of push-down: fuzziness and boosting. In addition, if you want to search across all fields instead of a single field (eg, “text”), then Elasticsearch lets you do this with a field called “_all”. Currently queries that use this feature will fail in Dremio (we hope to fix this soon).

One other note, these capabilities for push down are not supported with other sources - they are specific to Elasticsearch. CONTAINS behaves like standard SQL in other sources, even if they have a full text index (eg, MySQL, MongoDB).

Conclusion

In this tutorial we showed how you can combine SQL and Lucene search expressions to get the best of both worlds in terms of the broad support for SQL in popular tools like Tableau, and the power of search provided by Elasticsearch. While CONTAINS is a standard operator in SQL, each tool will be a little different in terms of how it allows you to input the Lucene expression string. For Tableau we show an example of this in the section “Text Search with Custom SQL” from the tutorial .