11 minute read · February 9, 2021

Eliminating Data Exports for Data Science with Apache Arrow Flight

Prateek M.

Prateek M. · Engineering Intern, Dremio

Announcing the General Availability of Apache Arrow Flight Client and Server Support in Dremio

Today we announced the general availability of Apache Arrow Flight, enabling a more than 15x performance improvement over ODBC. With this open source innovation, data scientists can now train their models on live datasets and avoid copy operations, which is a bad practice.

The Challenges with Supporting Data Science and BI Workloads at Data Lake Scale

ODBC and JDBC Have Become a Bottleneck for Data Science Use Cases
The ODBC and JDBC standards were developed in the early 1990s when infrastructure was limited and networking was orders of magnitude slower. They are designed for transporting small result sets from databases, not data lake scale datasets. As record-level client APIs, ODBC and JDBC are typically single threaded and require records to be serialized and deserialized, constraining throughput. Throughput is significantly below the capacity of modern networking capabilities (~10 Gbps).

Data Exports Lead to Security and Governance Problems
Data scientists need fast access to millions of records to accurately train their models. Since ODBC is a bottleneck, they extract or make copies of datasets onto their local machines. This is bad practice because:

  • The local copies become stale leading to inaccurate or confusing results from your data models.
  • It breaks organizational security and governance policies. Permissions and restrictions do not apply to exported datasets, inhibiting the organization from complying with regulations such as GDPR. For example, when the information about a person has to be deleted under “the right to be forgotten,” there is no way for IT to know that the person’s information is in various exports of the data.

Client APIs May Be Standard, but Protocols and Drivers Are Proprietary
A different connector is typically required for each database, which has led to the proliferation of separate and custom drivers, complicating the development and maintenance of applications that must connect to different systems due to compatibility, reliability and licensing issues. In addition, developers of distributed systems spend a lot of time in building and maintaining an internal remote procedure call (RPC) framework for communication between daemons running on different nodes to exchange data.

Introduction to Apache Arrow Flight

Apache Arrow, which was co-created by Dremio, is already an industry established standard for data representation with over 15M downloads every month. Arrow Flight is a sub-project of Apache Arrow and it provides a standard RPC data transport with Arrow buffers sent over gRPC. Its goal is to enable seamless communication between client-to-server/cluster and cluster-to-cluster via open standard protocol. In the context of client-to-server/cluster communication, it eliminates the need to use ODBC or JDBC drivers when communicating with a database that exposes an Arrow Flight endpoint.

Arrow Flight is designed for modern high-performance bulk data transfer for analytical workloads at data lake scale. It avoids serialization/deserialization penalties across tools when both parties are built on Apache Arrow (e.g., Dremio and Python data science libraries are now built on Arrow).

Arrow Flight is particularly beneficial for data science use cases, as they often involve accessing arbitrary large datasets. On the other hand, BI use cases may not benefit as much because they involve much smaller volumes of data (as required for a chart or dashboard).

Arrow Flight is cross-platform and supports languages such as Python, Java and C++, with more to come in the future. Data Scientists using popular libraries such as Pandas can build their machine learning (ML) models using Arrow Flight Python client and train it directly on Arrow Flight server endpoint without needing to make an explicit copy.

img

Performance Gains with Arrow Flight

Arrow Flight supports parallel access to data. This means that data need not be read-only from a single coordinator node. You can build a client that issues GetFlightInfo() and get a list of all serviceable endpoints (nodes). It can then issue GET requests in parallel to these endpoints and pull data down at a much higher throughput.

We ran a benchmark (simple SELECT *) for both SQL and Python. Here are the results:

img
img

Even without parallelization, we are able to achieve greater than 15x performance improvements over ODBC for reading the same number of Parquet records.

Arrow Flight on Dremio

Starting with the Dremio December release and Apache Arrow 3.0, clients can connect to an Arrow Flight server endpoint on Dremio using the Arrow Flight client libraries available in Arrow 3.0. In this release, we have also decoupled authentication from the handshake process and made it OAuth2.0 compliant.

Client libraries are available in Java, Python and C++ as part of the Apache Arrow project. If you are using Java, you will need Java SE 8.0 or above. If you prefer Python, the minimum requirement is Python 3.


To learn more about how you can eliminate data transfer bottlenecks with Arrow Flight, check out this webinar with Wes McKinney, Apache Arrow Co-Creator and Matt Topol, Principal Software Engineer at FactSet.

Save Your Spot


Getting Started

The following is a quick walk-through of a sample Python client application to help you work with Arrow Server 3.0 endpoints.

This lightweight application connects to the Arrow Flight server endpoint on Dremio. It requires the username and password for authentication. By default, the hostname is localhost and the port is 32010, but developers can change it. Moreover, the TLS option can be provided to establish an encrypted connection.

PRE-CONDITION: A sample application developed in Python 3 will require pyarrow-3.0.0 and Pandas.

STEP 1: Specify whether you’re using an encrypted TLS or unencrypted connection. The default is an unencrypted TCP connection.

# Default connection type is unencrypted TCP connection.
    scheme = "grpc+tcp"
    connection_args = {}
#TLS is supported with encryption. 
#Certificate validation is optional as some of the flight servers might use a 
#self-signed certificate and don't distribute a certificate for clients to use.
 
if tls:
    # Connect to the server endpoint with an encrypted TLS connection.
    scheme = "grpc+tls"
 
#If optional certificate validation is also set
if certs:
    # TLS certificates are provided in a list of connection arguments.
    with open(certs, "rb") as root_certs:
        connection_args["tls_root_certs"] = root_certs.read()

STEP 2: Create a Flight client with client properties.

# For optimizing connection, Two WLM settings may be provided upon initial
# authentication
# with the Dremio Server Flight Endpoint:
# - routing-tag
# - routing queue

initial_options = flight.FlightCallOptions(headers=[
(b'routing-tag', b'test-routing-tag'),
(b'routing-queue', b'Low Cost User Queries')
])

client_auth_middleware = DremioClientAuthMiddlewareFactory()
client = flight.FlightClient("{}://{}:{}".format(scheme, hostname, flightport),
         middleware=[client_auth_middleware], **connection_args)

STEP 3: Handle authentication and access tokens.

# Authenticate with the server endpoint and keep bearer token for further
# interactions with server
    bearer_token = client.authenticate_basic_token(username, password, 
initial_options)
    print('[INFO] Authentication was successful')

STEP 4: Upon successful authentication, construct FlightDescriptor for the query result set.

flight_desc = flight.FlightDescriptor.for_command(sqlquery)

STEP 5: Retrieve the schema of the result set.

options = flight.FlightCallOptions(headers=[bearer_token])
schema = client.get_schema(flight_desc, options)

STEP 6: Get the FlightInfo message to retrieve the ticket corresponding to the query result set.

flight_info = client.get_flight_info(flight.FlightDescriptor.for_command(sqlquery),
              options)

STEP 7: Retrieve the result set as a stream of Arrow record batches.

reader = client.do_get(flight_info.endpoints[0].ticket, options)

You can download the Python and Java sample application from the Dremio GitHub repository. The following CLI command demonstrates the simplest way to invoke the application by passing in required command line arguments.

$ example.py [-h] [-host HOSTNAME] [-port FLIGHTPORT] -user USERNAME 
-pass PASSWORD [-query SQLQUERY] [-tls] [-certs TRUSTEDCERTIFICATES]

Learn More

Check out these resources that walk you through the basics and also provide deep technical details about Apache Arrow Flight and its usage:

  1. Understanding Apache Arrow Flight
  2. Apache Arrow Flight project
  3. Magpie: Python at Speed and Scale Using Cloud Backends
  4. Dremio documentation on using Arrow Flight

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.