Dremio Jekyll


PyDremio The Unofficial Python Client for Dremio REST API

Mar 24, 2020

Introduction


Times go by so quickly, it feels like it was just a few weeks ago when I shared with you how to use Dremio and Python to process and visualize IoT data. Continuing to experiment with Dremio and being part of the Professional Services team has given me the opportunity to come up with ways to enhance the experience of our users. Today I want to share with you all the details about PyDremio, the unofficial python client for Dremio’s REST API.

This project caters to devops/admins with a pythonic abstraction of all of Dremio’s REST endpoints, making scripting things like production deployments, security management, and audit easier. It also provides the data scientist ways to leverage Dremio features like data lineage and collaboration tools directly from the Python console while making it easy to implement and interact with Dremio using Arrow Flight.


Features


Consistent and Pythonic API around Dremio’s REST API / Full Support for Dremio’s REST API

The PyDremio API provides a pythonic API around the Dremio REST endpoint. The user can manipulate all parts of the Dremio system via the catalog object includes: creating and editing sources, adding virtual data sources (VDSs), refreshing metadata or accelerations after an ETL job, querying data etc. the two main users of the library are the Data Scientist and the Devops Engineer. Let’s look at what both user groups can do:

Data Scientist

PyDremio makes an excellent addition to the Data Science toolkit. PyDremio exposes a dict-like object which allows users to access any part of the semantic layer as a python object. These objects can be manipulated to query, create, update or delete VDSs, reflections, sources and ACLs in Dremio. The library makes it easy for a data scientist to move around Dremio’s semantic data layer. Once the correct dataset is found the user can query the dataset to bring data back into a notebook as a Pandas dataframe. Any work done on the data set can be ‘saved’ back by creating a new VDS.

1
2
3
4
5
df = client.data.Business.Transportation.NYC_Trips.query()
# perform dataframe operations that can be translated back into sql
client.data.Business.Transportation.insert("vds", 'NYC_Trips_Analysis', sql=sql)
df_analysis = client.data.Business.Transportation.NYC_Trips_Analysis.query()
# df_analysis will be identical to the transformed version of df above

Devops Specialist

The Devops specialist has 3 different ways in which they can interact with Dremio with PyDremio:

  • The same interface as the Data Scientist uses: this allows interaction with Python objects which are directly mapped to the entities in Dremio
  • The raw REST endpoints translated into Python calls: this allows for direct manipulation of Dremio using JSON
  • The Command line interface (see below). Which enables working with Dremio directly on the command line

We have also included a set of utilities for common operations like: sending bulk SQL statements to Dremio asynchronously, iterating through the catalog and tools to help with things like continuous integration.

The kinds of things that the DevOps specialist is able to do with PyDremio include:

  • Continuous integration/Continuous deployment of Virtual Datasets and Data Reflections (DR)
  • Audit security settings for catalog entities for reporting
  • Query datasets and pipe results into other command line tools
  • Clone clusters and parts of clusters for migration, DR considerations etc
  • Run bulk queries asynchronously on Dremio from the command line or Python

More on these use cases coming in the future!


Enhanced query performance by leveraging Arrow Flight


PyDremio utilises the PyArrow library to query data from Dremio using the flight RPC mechanism recently added to flight. If you have installed the Flight plugin for Dremio you can use PyArrow when running queries on Dremio. This will happen transparently if you have Flight enabled. This will provide roughly a 20X increase in speed over ODBC. If you don’t have Flight available to you PyDremio will fall back to ODBC and finally the REST API.


Facilitate data discovery in Jupyter Notebooks using autocomplete


One of the main goals that I hoped to achieve was full interactivity of the API. In the example below, we fetch catalog entities from Dremio as the users presses the TAB key. Once the dataset is selected, they are presented with a representation of the entity and if available the wiki page for the entity (virtual dataset, data source, etc). From there they have the option to query it, refresh metadata, edit the source definition and much more.

 

Easy to use cross platform CLI


PyDremio uses the Click library to generate its command line interface where all the REST API calls can get exposed. To see the list of available calls simply run dremio_client –help on your terminal.

image alt text

The output of the dremio_client tool is JSON formatted to be consumed by other command line tools. For example, we use jq and awk to parse the status of reflections:

1
2
3
4
$ dremio_client --config . query --sql 'select * from sys.reflections' 2> /dev/null|jq '.[]|.name + "," + .status + "," + (.num_failures | tostring)' -r|awk -F, '{arr[$2]++}END{for (a in arr) print a, arr[a]}'
DISABLED 3
FAILED 2
CAN_ACCELERATE 18

This can be used for example to feed automated alerts on Dremio’s reflection status or to automate administrative functions.


Support for all Python versions between 2.7 - 3.7


Because not every environment is the same, we have chosen to support every version of Python all the way back to 2.7; this will allow users to leverage PyDremio on restricted environments where the latest versions of Python have not been deployed yet. We have also provided a version with fewer dependencies for the restrictive server deployments. The library is installable on all OSs via pip or conda. As we here at Dremio are passionate about open source we have released this under the Apache2.0 license and have made every effort to conform to python packaging and release conventions. To install:

Option 1: Full client installation via pip (recommended for data science use case)

1
$ pip install dremio_client[full]  

Option 2: Limited client installation via pip (recommended for devops and interacting with only the REST API)

1
$ pip install dremio_client

Option 3: Full client installation via conda

1
$ conda install -c rymurr dremio_client

What’s Next?


For a full technical breakdown checkout the main documentation for this project. There, you will be able to identify all the system requirements, installation steps, instructions on how to use it and much more.

PyDremio contains a lot of potential, it is an ideal tool for modern data science teams to explore and learn about their data via Dremio. I’m very excited to continue adding features such as: integration with ibis, this will allow queries to Dremio via a Pandas-like lazy syntax. We also will add a search capability for the data catalog object and the ability to write data back to Dremio via Flight. Finally, we will add more administrative tools to help monitor and view job logs, ACLs etc. There is still a lot of room for growth and since this is an open source project, contributions in the form of Github issues or Pull requests are welcome.

To learn more about Dremio visit our tutorials and resources, also if you would like to experiment with Dremio in your own virtual lab, go ahead and checkout Dremio University, and if you have any questions visit our community forums where we are all eager to help.

Ready to get started?