Dremio Jekyll

Enterprise Data Catalog Enhancements

Introduction

One of the many features that defines Dremio as a Data-as-a-Service platform, is the ability to catalog data as soon as you connect to it. Dremio’s data cataloging abilities up to this point have been basic; you can search for a field-name and Dremio will automatically provide a list of data sources (virtual or physical) that contain the search string either as a field-name or table-name.

We understand that searching for data in organizations usually is more complicated than it should; these are some of the answers we hear from our customers when we ask them how do they find the data they need:

  • I ask Bob, he knows. Hopefully Bob isn’t on vacation.
  • I send an email out to lots of people and hopefully one of them knows.
  • I ask on Slack.
  • I stand up and shout across the room.

And once they find out where the data is, best case they get details on how to connect by setting up an ODBC connection using their BI tools, then it’s up to them to continue along their data journey.

In Dremio we view finding data as a core requirement of building Data-as-a-Service. Unlike dedicated data catalog products, we view this as a key feature of a broader platform that provides analytical abilities, rich security features, data provenance and lineage, and much more.

Dremio 3.0 brings a new set of enhancements to its data cataloging abilities. In this tutorial, we will explore in detail these new features and learn how data consumers can use Dremio’s data catalog to enhance collaboration between teams by expanding the meaning of their datasets more than what a simple metadata repository would.

Assumptions

To get the most of this tutorial, we recommend that you first follow getting oriented to dremio and working with your first dataset tutorials.

Searching For Data

In this tutorial, I will play the role of a data consumer who has been tasked to analyze data from the New York City taxi network. Normally, without a proper data catalog, I would have to shout across my office to see who has recently created, or who has knowledge of where the data that I’m looking for resides. It also doesn’t help, that the only fact that I know about this data is that there is a field that contains the string “fare” which is really not that much information.

In an scenario like this, I have two options:

  • Spend a considerably large amount of time trying to search for the dataset that I need.
  • Re-create the dataset, which will translate into replication of data and costly delays.

Both of these options are time consuming and expensive. Let’s see how can I solve this challenge using Dremio.

One of the first things that you will notice when you log into the Dremio user interface is the search field on the top left corner of your working space.

search field

As mentioned earlier, the only fact that I know about the data that I need is that it contains a field named “fare”, and it is possible that it might not even be the full field-name. Let’s try to search for it.

list of available fields

Typing “fare” in the search box prompts Dremio to provide me with a list of all the data sources that I’ve been granted access to that contain either “fare” as their name or any of their fields. To verify this, we can dig a little deeper and explore the metadata for any of these datasets so we can make an educated decision of which one to choose.

selected dataset

Hovering over the dataset will enable the “i” option, clicking on it will allow us to see details about this dataset.

select fare amount

Navigating through the metadata we can verify that this dataset contains the field that we are interested in, and also that “fare” was partially the name of the field itself. From this point we can go ahead and continue exploring the data and perform analysis on it using our BI tool of choice.

Enhanced collaboration

Up to this point we talked about how once you connect to your data, Dremio captures a lot of information about the metadata from either physical or virtual datasets and makes them searchable for the data consumer. However, what if you want to “describe” these datasets or “flag” them, or create annotations of what that dataset would mean for any particular user?

In Dremio 3.0 we have provided users the ability to create searchable tags, descriptions, and annotations on their datasets to enhance collaboration, in this way users do not have depend on external sources to get more context about the data they are about to use or that they are searching for.

Continuing on the use case for this tutorial; Dremio provides a “Catalog” option once we explore the dataset. At that level the end user can perform several activities.

data catalog option

1. Add wiki content

Dremio is a tool that allows different teams to work together on the same data sources. A common request in this kind of collaborative environment is the ability to describe datasets in such a way that the information is useful to different users and teams, making it easier for users to search and find the data they need, eliminating the stigma of knowledge being stored in people’s brains and not being accessible or shared by others. This is in contrast to traditional systems where IT owns the names of fields and data sources, and data consumers have to keep their own notes in some external system like a spreadsheet or Wiki page or Post-it notes.

add wiki content

Now, within Dremio we have the ability to create comprehensive descriptions of any data set, by simply selecting the “add wiki content’ button inside the “catalog” perspective. This feature makes it easier for users to understand the data sources and also understand what is in them. Users can utilize this feature to keep track of key information to know what has been done to the datasource and how it is being used by other data consumers. An example of this, would be templatized fields as follows:

  • Data source
  • Owner
  • Contact
  • Last update
  • Example of reports that use this datasource and screenshots.

Here, I’ve added a description to a new data set that I uploaded from the San Francisco incidents database recorded by the police department.

add wiki content using plain text or markdown

The results are the following

add images to the wiki content

2.Adding Tags

We have added the ability to create tags for each dataset, following the same procedure, simply head over to the “catalog” view and you will see the “tags” field.

add wiki tags

Here we can add as many tags as needed to identify this dataset. I can add tags like “crimes”, “reports” and “wanted” which then will serve as search terms that will help other users find this particular dataset.

add wiki tags

And the best part is that those tags are searchable. Let’s give it a try. If we head back to our working space, we will notice that the tags are listed next to the dataset we want to work with.

tags are listed in the main space perspective

Clicking on each one of these tags, will automatically prompt a search in Dremio’s catalog and provide a list of all datasets that contain that particular tag and/or field name in it.

tags are made searchable by dremio

In the image above three things took place. 1) I selected the “San Francisco” tag for my dataset. 2) Dremio immediately started a search for all datasets tagged with that string. 3) Dremio successfully provided me with the “Incidents” dataset which contains the tag “San Francisco”.

Conclusion

In this this tutorial we talked about Dremio’s original data cataloging features plus the new enhancements that were implemented in the latest 3.0 release. Now not only data is made searchable, but teams can enhance collaboration by creating comprehensive wiki content for each dataset as well as searchable tags. We hope you enjoyed this tutorial, stay tuned to learn more about how can you gain insights from your data faster using Dremio.