h2h2h2

11 minute read · October 11, 2022

Introducing MAP Data Type in Dremio

Albert Vernon

Albert Vernon · Senior Product Manager, Dremio

Introduction

Data types in Dremio fall into two categories: primitive types such as INT and VARCHAR that hold single values, and semi-structured types like LIST and STRUCT that hold complex values.

Dremio Cloud as of late October and Dremio Software version 23.0 now have a semi-structured MAP data type that allows you to query map data from Apache Parquet files, Apache Iceberg, and Delta Lake. The MAP data type is a collection of key-value pairs and is useful for holding sparse data. 

The remainder of this article describes how to query MAP data in Dremio.

Using the MAP Data Type

Getting Started

If you have queried tables with MAP data using an earlier release of Dremio Cloud or Dremio Software, your cached table metadata has STRUCT as the data type instead of MAP. You must run ALTER TABLE table_name FORGET METADATA on those tables to refresh their metadata caches before you query them so that Dremio knows they have MAP rather than STRUCT. Otherwise, Dremio will continue classifying your data as STRUCT and give an error prompting you to reformat your dataset.

This feature is ON by default. If prefer the old behavior of representing MAP data as STRUCT, set dremio.data_types.map.enabled to OFF under ⚙ (Settings) > Support > Support Keys.

For most data sources, you can define internal schemas that override the data types or names of columns that Dremio detected using ALTER TABLE table_name MODIFY COLUMN or ALTER TABLE table_name RENAME COLUMN, respectively. You will need to re-run ALTER TABLE table_name MODIFY COLUMN or ALTER TABLE table_name RENAME COLUMN to restore any internal schemas you defined earlier. This is because ALTER TABLE table_name FORGET METADATA causes Dremio to forget internal schemas in addition to metadata caches.

Tip: The above steps are only necessary for old tables that you have queried at least once; Dremio will automatically detect MAP data in new tables.

Using the MAP Data Type

As mentioned above, the MAP data type is a collection of key-value pairs. MAP keys are case-insensitive strings. All values in a given map have the same type. For example, map<string, int> represents a mapping where the keys are strings and the values are integers.

The section below is an overview of how you can query the MAP data type. Suppose you have a table called "company" with a column named "employees" that is map<string, string>. You can retrieve the entire MAP column, which returns a collection of key-value pairs:

SELECT employees FROM company

You retrieve the value of MAP elements using column['key'] syntax:

SELECT employees['first_name'] FROM company

You can pass values of MAP elements as arguments to both scalar and aggregate functions:

SELECT UPPER(employees['first_name']) FROM company;
SELECT MAX(employees['first_name']) FROM company;

You can use values of MAP elements in filters:

SELECT *
FROM company
WHERE employees['first_name'] = 'Fred'

If your JDBC client program expects JSON, you can convert from MAP to JSON using CAST and CONVERT_TOJSON like this:

SELECT CAST(CONVERT_TOJSON(employee) AS VARCHAR) FROM company

You can ORDER BY and GROUP BY elements of a MAP if they evaluate to primitive types such as INT or VARCHAR:

SELECT employees['first_name'], COUNT(*)
FROM company
GROUP BY employees['first_name']
ORDER BY employees['first_name']

In addition, the following functions are available for MAP expressions:

SignatureDescription
SIZE(M)Returns number of elements in MAP expression M.
MAP_KEYS(M)Returns all keys from MAP expression M.
MAP_VALUES(M)Returns all values from MAP expression M.
TYPEOF(M)Returns "MAP" if M is a MAP expression.

Supported Sources

Dremio can read MAP data from Parquet files from any of the sources and table formats that it can read Parquet, including filesystem sources, metastores, Iceberg, and Delta Lake.

Differences from STRUCT

Both STRUCT and MAP are semi-structured data types that hold collections of key-value pairs, but they differ in the following ways:

  • MAP values can only be primitive types; STRUCT values can be any type. [This restriction was removed in Dremio v24.]
  • All values in a given MAP must be the same type; STRUCT values can be of different types.

Example

An e-commerce store sells a variety of products. Each category has particular properties that need to be tracked. For example, clothing has size and color, but books have publication dates and cover types (hardcover vs. paperback). Instead of having a "products" table with dozens of columns — most of which are NULL since they are not applicable — the table has a single MAP column named "properties," which is map<string, string>.

The Python 3 script below uses the pyarrow library to create a Parquet file with sample data:

#!python

import pyarrow as pa
import pyarrow.parquet as pq

description = ["Dictionary", "Dress", "Wagon"]

properties = [
    [("Cover", "Hardcover"), ("Publication Year", "2002"), ("Color", "Blue")],
    [("Size", "Medium"), ("Color", "Red")],
    [("Material", "Metal"), ("Color", "Red")],
]

assert len(description) == len(properties)
maparr = pa.array(properties, type=pa.map_(pa.string(), pa.string()))
table = pa.table([description, maparr], names=["description", "properties"])
pq.write_table(table, "products.parquet")

Using the parquet-cli package, you can view the raw sample file with parquet cat products.parquet:

{"description": "Dictionary", "properties": {"Publication Year": "2002", "Cover": "Hardcover", "Color": "Blue"}}
{"description": "Dress", "properties": {"Size": "Medium", "Color": "Red"}}
{"description": "Wagon", "properties": {"Color": "Red", "Material": "Metal"}}

To query the sample data, do the following in the Dremio SQL Runner:

  1. Select the "Datasets" icon in the navigation bar.
  2. Select ⊕ then "Upload File" in the upper right.
  3. Browse to products.parquet and select "Next."
  4. Select "Save" to accept the auto-detected format of Parquet.

Now you can query the "products" table in the usual way. For example:

SELECT TYPEOF(properties) FROM products;
-- MAP
-- MAP
-- MAP
SELECT SIZE(properties) FROM products;
-- 3
-- 2
-- 2
SELECT MAP_KEYS(properties) FROM products;
-- ["Cover","Publication Year","Color"]
-- ["Size","Color"]
-- ["Material","Color"]
SELECT MAP_VALUES(properties) FROM products;
-- ["Hardcover","2002","Blue"]
-- ["Medium","Red"]
-- ["Metal","Red"]
SELECT properties['Cover'] FROM products;
-- Hardcover
-- null
-- null
SELECT description FROM products WHERE properties['Color'] = 'Red';
-- Dress
-- Wagon
SELECT properties FROM products;
-- { "Cover": "Hardcover", "Publication Year": "2002", "Color": "Blue" }
-- { "Size": "Medium", "Color": "Red" }
-- { "Material": "Metal", "Color": "Red" }
SELECT properties['Color'], COUNT(*)
FROM products
GROUP BY properties['Color']
ORDER BY COUNT(*) DESC
LIMIT 2
-- Red, 2
-- Blue, 1

Get Started with Dremio Cloud – It’s Free!

Dremio Cloud: The easy and open, fully managed data lakehouse platform.

Sign Up Now

Everything you need to build, automate, and query your data lakehouse in production.

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.