Enterprise Data Catalog Enhancements
Dremio’s semantic layer is an integrated, searchable catalog that indexes all of your metadata, so business users can easily make sense of your data. Virtual datasets and spaces make up the semantic layer, and are all indexed and searchable.
We understand that searching for data in organizations usually is more complicated than it should; these are some of the answers we hear from our customers when we ask them how do they find the data they need:
- I ask Bob, he knows. Hopefully Bob is available.
- I send an email out to lots of people and hopefully one of them knows.
- I ask on Slack.
- I stand up and shout across the room.
And once they find out where the data is, best case they get details on how to connect by setting up an ODBC connection using their BI tools, then it’s up to them to continue along their data journey.
In Dremio we view finding and accessing cloud data as a core requirement. Unlike dedicated data catalog products, we view this as a key feature of a broader platform that provides analytical abilities, rich security features, data provenance and lineage, and much more.
In this tutorial, we will learn how data consumers can use Dremio to enhance collaboration between teams by expanding the meaning of their datasets more than what a simple metadata repository would.
Searching For Data
In this tutorial, I will play the role of a data consumer who has been tasked to analyze a dataset that contains taxi ridership data from the city of New York. Normally, I would have to shout across my office to see who has recently created, or who has knowledge of where the data that I’m looking for resides. It also doesn’t help that the only fact that I know about this data is that there is a field that contains the string
fare which is really not that much information.
In an scenario like this, I have two options:
- Spend a considerably large amount of time trying to search for the dataset that I need.
- Re-create the dataset, which will translate into replication of data and costly delays.
Both of these options are time consuming and expensive. Let’s see how we can solve this challenge using Dremio.
One of the first things that you will notice when you log into the Dremio UI is the search field on the top left corner of your working space.
The only fact that I know about the data that I need is that it contains a field named “fare”, and it is possible that it might not even be the full field-name. Let’s try to search for it.
Typing “fare” in the search box prompts Dremio to provide a list of all the data sources that I’ve been granted access to that contain either “fare” as their name. To verify this, we can dig a little deeper and explore the metadata for any of these datasets so we can make an educated decision of which one to choose.
Hovering over the dataset will enable the “i” option, clicking on it will allow us to see details about this dataset.
Navigating through the metadata we can verify that this dataset contains the field that we want. We also see that “fare” was partially the name of the field itself. From this point we can go ahead and continue exploring the data and perform analysis on it using our BI tool of choice.
Up to this point we talked about how once you connect to your data, Dremio captures a lot of information about the metadata from either physical or virtual datasets and makes them searchable for the data consumer. However, what if you want to “describe” these datasets or “flag” them so users can make sense of them?
Dremio allows users to create searchable tags, descriptions, and annotations on their datasets to enhance collaboration, this way users do not depend on external sources to get more context about the data they are searching or about to use.
Continuing on the use case for this tutorial; Dremio provides a “Catalog” option once we explore the dataset. At that level the end user can perform several activities.
1. Add wiki content
Dremio is a tool that allows different teams to work together on the same data sources. A common request in this kind of collaborative environment is the ability to describe datasets in such a way that the information is useful to different users and teams, making it easier for users to search and find the data they need, eliminating the stigma of knowledge being stored in people’s brains and not being accessible or shared by others. This is in contrast to traditional systems where IT owns the names of fields and data sources, and data consumers have to keep their own notes in some external system like a spreadsheet or Wiki page or sticky notes.
Now, within Dremio we have the ability to create comprehensive descriptions of any data set, by simply selecting the
add wiki content button inside the
catalog perspective. This feature makes it easier for users to understand the data sources and also understand what is in them. Users can utilize this feature to keep track of key information to know what has been done to the datasource and how it is being used by other data consumers. An example of this, would be templatized fields as follows:
- Data source
- Last update
- Example of reports that use this datasource and screenshots.
Here, I’ve added a description to a new data set that I uploaded from the San Francisco incidents database recorded by the police department.
The results are the following
We have added the ability to create tags for each dataset, following the same procedure, simply head over to the “catalog” view and you will see the “tags” field.
Here we can add as many tags as needed to identify this dataset. I can add tags like “Taxi”, “NYC” and “Trips” which then will serve as search terms that will help other users find this particular dataset.
And the best part is that those tags are searchable. Let’s give it a try. If we head back to our working space, we will notice that the tags are listed next to the dataset we want to work with.
Clicking on each one of these tags, will automatically prompt a search in Dremio’s catalog and provide a list of all datasets that contain that particular tag and/or field name in it.
In the image above three things took place. 1) I selected the “Ridershare” tag for my dataset. 2) Dremio immediately started a search for all datasets tagged with that string. 3) Dremio successfully provided me with the “NYC_Trips” dataset which contains the tag “Ridershare”.
In this tutorial we talked about Dremio’s semantic layer features. Now not only data is made searchable, but teams can enhance collaboration by creating comprehensive wiki content for each dataset as well as searchable tags. We hope you enjoyed this tutorial, stay tuned to learn more about how you can gain insights from your data faster using Dremio.