Getting Oriented to Dremio
Welcome to Dremio!
This tutorial will orient you to the basic concepts of Dremio, and point you to resources that will help you now and in the future. We also have created a video if you would like to sit back and watch instead.
You can participate in this tutorial without installing Dremio. However, we think it will be easier to follow along if you have an installation you can access. Deploying Dremio is easy — check out the deployment page for more detail. If you have any questions throughout the tutorial, feel free to post them on the Dremio Community site.
Companies create data in a wide range of technologies, including relational databases, SaaS applications, NoSQL, Amazon S3, Apache Hadoop, and other systems. In order to make sense of all this data, companies use business intelligence (BI) tools like Tableau, Power BI, and Qlik, or data science products like Python and R. To make data from a variety of sources available to all these analytical technologies, companies build data warehouses, data marts, and data lakes, and move data with ETL, custom scripts, or data prep tools. With this approach, companies build enormous complexity and cost around their data, and create an environment where business users are entirely dependent on IT.
We built Dremio to help analysts, data scientists, and data engineers be more effective with data. Dremio provides an integrated, self-service interface for data lakes. Designed for BI users and data scientists, the data lake query engine incorporates capabilities for data acceleration, data curation, and data lineage — all on any source and delivered as a self-service platform. Standout features include:
SQL on any data source, including optimized pushdowns, and parallel connectivity to non-relational systems like S3 and HDFS.
Accelerated data queries using data reflections, a highly optimized representation of source data that is managed as columnar, compressed Apache Arrow for efficient in-memory analytical processing, and Apache Parquet for persistence.
Integrated data curation which is easy for business users, yet sufficiently powerful for data engineers, and fully integrated into Dremio.
Cross-data source joins across multiple disparate systems and technologies, between relational and NoSQL, S3, HDFS, and more.
Full visibility into data lineage from data sources through transformations, joins with other data sources, and sharing with other users.
If you’d like to know more about Dremio’s design, check out our Dremio Architecture Guide.
To initially access Dremio, you’ll be asked to create an administrator account:
Once the admin account is set up, users will be asked to log in to Dremio:
Once you’re authenticated, you will see the home screen. If this is a new installation, it will look fairly sparse:
Let’s explore the home screen, starting with the bottom left, “Data Lakes” and “External Sources”:
Dremio connects to these sources to access datasets. If you already have a source in mind, you can click on the + sign to connect to a new data lake source or to an external source at any time. You’ll see a prompt that will allow you to set up different kinds of sources from your data lake, including table stores and file stores:
This tutorial uses a sample dataset stored on Amazon S3 that’s easy to connect to. Simply click the “Sample Source” button. You should now see a Samples source listed under your sources, and a samples.dremio.com bucket listed as a folder on the right side of the screen:
If you double click on samples.dremio.com, you’ll see the public files that are used in our next tutorial that takes a look at how to work with these files:
Continuing our Dremio orientation, let’s create a “space.” Spaces allow users to collaborate around virtual datasets. For example, you might create a space for a project, a team, or a department. Let’s create a new space called “Dremio101.” Click on the “New Space” button on the left side of the screen (alternatively, you can click on the + sign on the Spaces panel):
By default, new spaces are public and visible to everyone. You can configure which users have access to a space, either now or after the space is created. For now, let’s leave this space public.
Click “Save” to finish creating the “Dremio101” space. The screen should now show “Dremio101” as a space:
At the moment the space is empty, as shown by the 0 next to the name of the space. Users can save their virtual datasets in spaces they have “Can Edit” access to. We’ll take a closer look at these capabilities in our next tutorial.
Now, just above spaces, you can see our user name next to a home icon.
This is your home space. Every Dremio user has a private home space where they can upload files or store virtual datasets that they are working on, which are not visible to other users. Datasets in your home space are like any other dataset in Dremio — they can be queried, joined to other datasets, analyzed with BI tools or data science tools, and more.
Now let’s take a closer look at the toolbar at the top left of the screen:
- Datasets include sources, spaces, and your home space.
- Jobs are units of work processed by Dremio. For example, when a user issues a query to a dataset, this request is processed as a job. For each job, Dremio tracks details such as who issued the request, metadata about the results, any errors, and other useful information. At this point, we haven’t set up any datasets, so there are no jobs in the system yet.
- Search allows you to quickly find datasets. As you connect data sources, Dremio indexes metadata like the names of tables, columns, and fields. This makes it easy to find a dataset across all of your different sources and spaces.
- New Query opens Dremio’s query editor. You can write full SQL against any of the datasets in Dremio and see results.
Next, let’s take a look at the buttons on the upper right:
- Help gives you access to the Dremio Community site, documentation, and other resources.
- Admin gives you access to Dremio’s administration menus.
- Your name gives you access to your personal profile.
Now that you have a sense for how Dremio works, it’s time to get started working with some data. Check out the next tutorial that focuses on Working with Your First Dataset.