The Gandiva Initiative for Apache Arrow Offers a Breakthrough in Performance and Efficiency for Analytics, Machine Learning, and Data Science
SANTA CLARA, CA - November 29, 2018 - Dremio, the Data-as-a-Service Platform company, announced today it has donated an LLVM-based execution kernel called the Gandiva Initiative for Apache Arrow to The Apache Software Foundation, where the project will continue to grow and thrive as part of the Apache Arrow community.
According to Wes McKinney, creator of Python Pandas, and member of the Apache Arrow and Apache Parquet Project Management Committees (PMCs), “Data scientists and engineers work with multiple languages and systems, making it critical that data flow seamlessly and efficiently between these environments. Apache Arrow has made enormous progress since its inception, and Gandiva—an analytical expression compiler—is a natural next step in the growth of the ecosystem. This code donation provides huge opportunities for optimization with modern hardware platforms allowing for dramatically better performance for a wide range of workloads on Arrow data. It’s a great benefit for the Apache Arrow community.”
Gandiva leverages the LLVM Project, a popular open source compiler framework, to significantly improve the speed and efficiency of performing in-memory analytics using Apache Arrow. It provides up to 100x greater efficiency on many types of queries and operations; translating into lower operational costs, better user experience, and the ability to support more workloads with existing hardware.
“Gandiva is aimed at an industry-wide pain point and our goal is for it to gain widespread adoption that will have a major impact on the analytics and data science communities,” said Jacques Nadeau, co-founder and CTO of Dremio, and PMC Chair of Apache Arrow. “Dremio’s goal with this donation of Gandiva is to get an entire class of products and projects to take better advantage of vector processing – beyond what the company is developing. As part of the Apache Arrow project, Gandiva is now available to a wide range of projects including Apache Spark, Pandas, and Node.js.”
Nadeau added, “The Apache Arrow project has the support and participation of a lot of companies in the data and analytics space and is endorsed by NVIDIA through its new RAPIDS open source libraries, which has adopted Arrow as its official columnar data representation format.”
Gandiva provides significant performance improvements for low-level operations on Arrow columnar memory. It introduces a cross-platform data processing engine for Arrow, designed to be used in many contexts. In Dremio 3.0, Gandiva powers the SQL execution engine. Gandiva’s LLVM-based compiler, combined with Arrow’s efficient columnar representation, enable Dremio to take full advantage of vectorization in the CPU for many types of workloads.
By specializing computations for Arrow columnar memory using LLVM, low-level operations such as sorts, filters, and projections can be highly optimized for specific runtime environments, improving resource utilization and providing faster, lower-cost operations of analytical workloads.