Dremio Jekyll

Running SQL Joins in Elasticsearch With Dremio

Intro

Elasticsearch is a popular search engine built on Lucene. Data is stored as JSON documents, and Elasticsearch provides powerful search and analytical features. Queries are issued in the Elasticsearch DSL, which is expressed in JSON. The DSL supports the notions of filtering, projections, aggregations, and other SQL primitives, but it does not support joins.

In this tutorial we’ll show you how you can use Dremio to issue standard SQL against Elasticsearch, including joins across indexes. With Dremio you can take any SQL-based tool, such as Tableau, Qlik, Power BI, Python, or R and query Elasticsearch as if it were a relational database. We’ll also show you how you can join between Elasticsearch and other systems, like relational databases, HDFS, and MongoDB.

Assumptions

We assume you are familiar with Elasticsearch and have access to a cluster running Elasticsearch (2.x and 5.x are supported). We also assume you have loaded the Yelp dataset. Instructions on how to download this data and load into Elasticsearch are covered in our tutorial on Data Curation.

About The Data

The Yelp dataset is spread across 5 indexes:

  • Businesses: 61,179
  • Reviews: 1,569,264
  • Users: 366,715
  • Checkins: 45,166
  • Tips: 1,028,802

In this dataset, all business have one or more reviews, many have tips, and some have checkins (a visit to the business by a user, created through the Yelp app manually, which explains why there are so few). The reviews are written by users, and clearly some users write many more than others.

Looking through a relational lens, we could visualize these indexes in an ER diagram like the following:

Yelp dataset as ER diagram

With SQL we can easily ask questions about this dataset. Let’s work through a few examples.

Who are the top 10 users in terms of the number of reviews?

Looking at users and their reviews together is probably something that’s useful for a number of different queries, so let’s make this a little easier for everyone by first building a virtual dataset. This is easy to do in Dremio. First, navigate to the user dataset:

Navigate to the user physical dataset

Then click the Join button to build the join to the review dataset:

Build the join between users and reviews

Use the user_id as the join key between the two tables, and be sure to select left outer join as the join type so we get all users even if they don’t have any reviews:

Specify the join key and join type

Click apply, and now you have a virtual dataset that includes all users and their reviews. Let’s save this in a space (here, the space is called yelp_challenge) and call the virtual dataset user+reviews:

Save the virtual dataset as user+review in the yelp_challenge space

Now it’s easy to write the query for the top 10 users based on number of reviews:

1
2
3
4
5
SELECT user_id, name, COUNT(*) as "review_count"
FROM yelp_challenge."user+review"
GROUP BY user_id, name
ORDER BY COUNT(*) DESC
LIMIT 10

Top 10 Users By Review Count

What is the average number of stars given by users with 5 or more reviews?

This query benefits from the virtual dataset we built for the last query. We can easily show the top 10 in terms of number of reviews:

1
2
3
4
5
6
7
8
9
SELECT user_id, name, num_reviews, avg_stars
FROM (
  SELECT user_id, name, COUNT(*) as "num_reviews", AVG(stars) as "avg_stars"
  FROM yelp_challenge."user+review"
  GROUP BY user_id, name
) nested_0
 WHERE 5 <= num_reviews
 ORDER BY num_reviews DESC, avg_stars DESC
 LIMIT 10

Interestingly, there is only one user, Gabi, whose average is above 4. Everyone else is between 3 and 4:

Top 10 Users By Review Count

Which users always give five stars, and which always give only one (ignoring users with only one review)?

Here’s the query for “haters” who never give more than one star:

1
2
3
4
5
6
7
8
SELECT user_id, name, stars, review_count
FROM (
  SELECT user_id, name, MAX(stars) as "stars", COUNT(*) as "review_count"
  FROM yelp_challenge."user+review"
  GROUP BY user_id, name
  ORDER BY review_count DESC
) nested_0
 WHERE stars = 1.0

Hater reviews

And here’s the query for “lovers” who only give 5 star reviews:

1
2
3
4
5
6
7
8
SELECT user_id, name, stars, review_count
FROM (
  SELECT user_id, name, MIN(stars) as "stars", COUNT(*) as "review_count"
  FROM yelp_challenge."user+review"
  GROUP BY user_id, name
  ORDER BY review_count DESC
) nested_0
 WHERE stars = 5.0

Lover reviews

Overall there are 114,023 reviewers who never give less than 5 stars, and 42,004 reviewers who never give more than one star. Fortunately for these businesses, people seem to be more inclined to post a favorable review. This is actually surprising - many people will only take the time to post when they are very disappointed in their experience, yet there are almost 3x as many people who give 5 stars exclusively.

Which city gets the best reviews, and which one gets the worst?

Just as we built a virtual dataset for users and reviews to simplify analysis, we can do the same for businesses and reviews. You can follow a similar process to create this virtual dataset, and in this case it also makes sense to use a left outer join to keep all businesses, even if they don’t have reviews. In our example we called this bus+reviews and saved it in our yelp_challenge space.

The SQL used to define this virtual dataset (which is created automatically, based on the creation process) is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
SELECT 
  nested_0.hours AS hours, 
  nested_0.neighborhoods AS neighborhoods, 
  nested_0.city AS city, 
  nested_0.latitude AS latitude, 
  nested_0.review_count AS review_count, 
  nested_0.full_address AS full_address, 
  nested_0.stars AS stars, 
  nested_0.type AS type, 
  nested_0.name AS name, 
  nested_0.attributes AS attributes, 
  nested_0.categories AS categories, 
  nested_0.state AS state, 
  nested_0."open" AS "open", 
  nested_0.longitude AS longitude, 
  nested_0."_index" AS "_index", 
  nested_0."_type" AS "_type", 
  nested_0."_uid" AS "_uid", 
  nested_0.business_id AS business_id, 
  join_review.business_id AS business_id0, 
  join_review."date" AS "date", 
  join_review.review_id AS review_id, 
  join_review.user_id AS user_id, 
  join_review.votes AS votes, 
  join_review.stars AS stars0, 
  join_review.text AS text, 
  join_review.type AS type0, 
  join_review."_index" AS "_index0", 
  join_review."_type" AS "_type0", 
  join_review."_uid" AS "_uid0"
FROM (
  SELECT 
    hours, 
    neighborhoods, 
    city, 
    latitude, 
    review_count, 
    full_address, 
    stars, 
    type, 
    name, 
    attributes, 
    categories, 
    state, 
    business_id, 
    business."open" AS "open", 
    longitude, 
    "_index", 
    "_type", 
    "_uid"
  FROM elastic.yelp.business
) nested_0
LEFT JOIN 
  elastic.yelp.review AS join_review 
ON
  nested_0.business_id = join_review.business_id

Looking at the data, lots of small cities have only a few reviews, so let’s limit this analysis to cities with 100 or more reviews. Here are the best cities:

1
2
3
4
5
6
7
8
SELECT city, avg_stars, total_reviews
FROM (
  SELECT city, TRUNCATE(AVG(stars),2) as "avg_stars", COUNT(*) as "total_reviews"
  FROM yelp_challenge."bus+review"
  GROUP BY city
  ORDER BY AVG(stars) DESC
) nested_0
 WHERE 100 <= total_reviews

Best cities

And here are the worst. Sorry North Los Vegas!

Worst cities

Actually, if you remove the 100 review minimum, there are only 2 cities with an average of less than 2 stars, and one of them is San Diego. This is based on only 23 reviews, which suggests the data sample is small and non-representative for this city.

These are just some simple examples, feel free to explore on your own.

Combining Elasticsearch With Other Sources

So far we’ve been working with this data in Elasticsearch. But we can do the same sort of analysis with the data in other sources, such as HDFS, Amazon S3, or your local filesystem. In the environment I’m using here, I happen to have the same data in MongoDB and HDFS.

Here’s an example of a query that shows restaurants with great reviews based on the text using a search in Elasticsearch, filtered by the location of the business as stored in MongoDB, limited by reviews from users that have a minimum of 20 reviews as stored in HDFS.

I broke the work into one virtual dataset per source, then joined them together for a fourth and final dataset. This approach has the advantage of allowing you to work on each “leg” independently. For each, the query was built through the GUI (except the Elasticsearch query, where the search expression had to be entered into the SQL manually).

First, here’s the Elasticsearch query, which pushes down a Lucene expression using the SQL contains operator:

1
2
3
SELECT *
FROM elastic.yelp.review
WHERE CONTAINS(text:'excellent OR awesome OR amazing OR fantastic OR killer NOT (terrible OR awful OR disgusting OR worst)')

Next, here’s the MongoDB query, which un-nests the category array and filters on Restaurant, as well as state, and limits to a minimum of 50 reviews:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
SELECT 
  "_id", 
  business_id, 
  full_address, 
  hours, 
  nested_0."open" AS "open", 
  categories, 
  city, 
  review_count, 
  name, 
  neighborhoods, 
  longitude, 
  state, 
  stars, 
  latitude, 
  attributes, 
  type
FROM (
  SELECT 
    "_id", 
    business_id, 
    full_address, 
    hours, 
    business."open" AS "open", 
    flatten(categories) AS categories, 
    city, 
    review_count, 
    name, 
    neighborhoods, 
    longitude, 
    state, 
    stars, 
    latitude, 
    attributes, 
    type
  FROM mongo.yelp.business
) nested_0
 WHERE 
  (categories = 'Restaurants') AND 
  (state IN ('PA', 'AZ')) AND
  (50 <= review_count)

Then there’s the HDFS query, which limits reviews to those from users with more than 3:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
SELECT 
  business_id, 
  "_id", 
  votes, 
  user_id, 
  review_id, 
  stars0, 
  "reviews.json"."date" AS "date", 
  text, 
  type0
FROM 
  hdfs.data.yelp."reviews.json"
WHERE 
  3 <= stars0

The whole query together, which is created through the join dialog so you (thankfully) don’t need to write yourself:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
SELECT 
  nested_1."_id" AS "_id", 
  nested_1.full_address AS full_address, 
  nested_1.hours AS hours, 
  nested_1."open" AS "open", 
  nested_1.categories AS categories, 
  nested_1.city AS city, 
  nested_1.review_count AS review_count, 
  nested_1.name AS name, 
  nested_1.neighborhoods AS neighborhoods, 
  nested_1.longitude AS longitude, 
  nested_1.state AS state, 
  nested_1.stars AS stars, 
  nested_1.latitude AS latitude, 
  nested_1.attributes AS attributes, 
  nested_1.type AS type, 
  nested_1.business_id AS business_id, 
  nested_1.business_id0 AS business_id0, 
  nested_1."date" AS "date", 
  nested_1.user_id AS user_id, 
  nested_1.votes AS votes, 
  nested_1.stars0 AS stars0, 
  nested_1.text AS text, 
  nested_1.type0 AS type0, 
  nested_1."_index" AS "_index", 
  nested_1."_type" AS "_type", 
  nested_1."_uid" AS "_uid", 
  nested_1.review_id AS review_id, 
  join_pos_reviews.review_id AS review_id0, 
  join_pos_reviews.business_id AS business_id1, 
  join_pos_reviews."_id" AS "_id0", 
  join_pos_reviews.votes AS votes0, 
  join_pos_reviews.user_id AS user_id0, 
  join_pos_reviews.stars0 AS stars00, 
  join_pos_reviews."date" AS date0, 
  join_pos_reviews.text AS text0, 
  join_pos_reviews.type0 AS type00
FROM (
  SELECT 
    nested_0."_id" AS "_id", 
    nested_0.full_address AS full_address, 
    nested_0.hours AS hours, 
    nested_0."open" AS "open", 
    nested_0.categories AS categories, 
    nested_0.city AS city, 
    nested_0.review_count AS review_count, 
    nested_0.name AS name, 
    nested_0.neighborhoods AS neighborhoods, 
    nested_0.longitude AS longitude, 
    nested_0.state AS state, 
    nested_0.stars AS stars, 
    nested_0.latitude AS latitude, 
    nested_0.attributes AS attributes, 
    nested_0.type AS type, 
    nested_0.business_id AS business_id, 
    "join_great-reviews".business_id AS business_id0, 
    "join_great-reviews"."date" AS "date", 
    "join_great-reviews".review_id AS review_id, 
    "join_great-reviews".user_id AS user_id, 
    "join_great-reviews".votes AS votes, 
    "join_great-reviews".stars AS stars0, 
    "join_great-reviews".text AS text, 
    "join_great-reviews".type AS type0, 
    "join_great-reviews"."_index" AS "_index", 
    "join_great-reviews"."_type" AS "_type", 
    "join_great-reviews"."_uid" AS "_uid"
  FROM (
    SELECT 
      "_id", 
      business_id, 
      full_address, 
      hours, 
      "restaurants+PA+AZ"."open" AS "open", 
      categories, 
      city, 
      review_count, 
      name, 
      neighborhoods, 
      longitude, 
      state, 
      stars, 
      latitude, 
      attributes, 
      type
    FROM yelp_challenge."restaurants+PA+AZ"
  ) nested_0
   INNER JOIN 
    yelp_challenge."great-reviews" AS "join_great-reviews" ON 
    nested_0.business_id = "join_great-reviews".business_id
) nested_1
 INNER JOIN 
  yelp_challenge.pos_reviews AS join_pos_reviews ON 
  nested_1.review_id = join_pos_reviews.review_id

You can save this as a virtual dataset called “vetted_restaurants” in your yelp_challenge space then run this simple query:

1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT 
  name, 
  city, 
  review_count, 
  AVG(stars) as stars
FROM 
  yelp_challenge.vetted_restaurants
GROUP BY 
  name, 
  city, 
  review_count
ORDER BY stars DESC
LIMIT 10

And here are the those “vetted” restaurants. Not only do they have positive reviews in terms of stars (only from reviewers with 3 or more reviews), but also qualitative reviews based on the language of the review:

Vetted Restaurants

One nice feature of Dremio Enterprise Edition is the ability to analyze the lineage of this virtual dataset:

Data Lineage - MongoDB, Elasticsearch, HDFS

You can clearly see that the Top Vetted Restaurants is derived from the Vetted Restaurants, which in turn are derived from other virtual datasets. You can also see that these trace back to Elasticsearch, MongoDB, and HDFS. For each step along the way, you can quickly navigate to see all the queries ever issued against any of the virtual or physical datasets, who issued the query, the number of results, etc. It’s a powerful way to make sense of how your data is being analyzed across all your tools and all your users.

Next Steps

If you’d like to know more about how Dremio works with Elasticsearch, we suggest reading the chapter in the docs and asking questions on the community site. We continue to build out the capabilities of Dremio with Elasticsearch, and are interested in your thoughts and feedback.

As a next step, you can look at Reflections as a way to accelerate your queries on Elasticsearch. This has the added advantage of offloading your analytical queries from your Elasticsearch cluster.

You can also look at using Tableau and other BI and visualization tools with Dremio as a front end to your analysis of your Elasticsearch data with Dremio.