What Is Apache Parquet?

   
  • Dremio

Table of Contents

Table of Contents

...

Apache Parquet is an open source file format that stores data in columnar format (as opposed to row format). As a columnar data storage format, it offers several advantages over row-based formats for analytical workloads. Your choice of data format can have significant implications for query performance and cost, so it’s important to understand the differences between Apache Parquet and other file formats.

Apache Parquet

Benefits of Apache Parquet

In simple terms, row-based formats such as CSV and JSON are (mostly) readable by humans, whereas column-based formats are optimized for computers. As a columnar file format, Apache Parquet can be read much more efficiently and cost-effectively than other formats, making it an ideal file format for big data, analytics, and data lake storage.

High Performance

In Apache Parquet, each column’s values are stored together on disk. Because analytical queries often need only a subset of columns for an operation, this reduces the amount of data needed to be read.

Parquet files also have statistics about the data stored within the file, such as minimum and maximum values for the column’s data in the file, and how many rows are in that segment. This allows engines to skip entire segments or entire files based on what portion of the dataset the job is looking for.

These two features result in significant performance improvements and lowering overall costs, due to a reduction in the amount of data needed to be read.

Efficient Compression

Apache Parquet supports highly efficient compression. Many compression codecs are more effective when they compress similar data. Parquet’s columnar format means that columns of similar data can be compressed together, resulting in greater efficiencies. 

Compressed data is more cost-effective to store than raw data, so using Parquet can reduce the cost of storing large data sets. Compressed data is also more performant to read than uncompressed data when I/O is the bottleneck, which can often be the case in analytical workloads, so compressed data in Parquet files can improve performance as well.

Industry Standard

Apache Parquet is the industry standard columnar file format. Parquet files can be read by just about every engine and tool out there. This provides you interoperability to use multiple tools today, and peace of mind you’ll be able to leverage new engines and tools that come out tomorrow, since they’ll most likely support Parquet.


When to Consider an Alternative to Apache Parquet

Apache Parquet is the industry standard file format for analytical workloads. However, different workloads and requirements may result in a different format for certain situations.

Human Readability

Parquet is a binary-based (rather than text-based) file format optimized for computers, so Parquet files aren’t directly readable by humans. You can’t open a Parquet file in a text editor the way you might with a CSV file and see what it contains. There are utilities that convert the binary representation to a text-based one, such as parquet-tools, but it’s an extra step. 

Write Overhead

Parquet files can be slower to write than row-based file formats, primarily because they contain metadata about the file contents. For analytical purposes, these slower write times are more than made up for by fast read times. However, for situations where data freshness and event latency are most important (e.g., in the 10s of milliseconds range), it can be worth leveraging a row-based format without statistics like Avro or CSV. 


Apache Parquet vs. CSV

The most common file format in the world at large is CSV (comma-separated values). It’s used by familiar applications like Microsoft Excel or Google Sheets. CSV files are easily opened for human review, and some data analysts are comfortable working with large CSV files. Here is where it’s better to use Parquet files than CSV.

Analytics

Parquet is better suited for OLAP (analytical processing) workloads than CSV. CSV is good for interchange because it’s text-based and a very simple standard that’s been around for a very long time, but Parquet’s columnar structure and statistics enable queries to zero in on the most relevant data for analytic queries by selecting a subset of columns and rows to read. By contrast, a row-based format like CSV requires you to read the entire file, and where a table is composed of CSV files, the whole table/partition. 

Where Performance Matters

Because queries are only made on a subset of columns rather than against the entire dataset, entire files can be skipped if the query doesn’t ask for it. Parquet has much better compression, and responses are much faster with Parquet than with CSV. 

More Cost-Effective

Parquet can result in significant cost savings both in I/O and storage compared to CSV. For example, after converting from CSV to Parquet, data-as-a-service firm Veraset was able to reduce by half the time it took to run its data pipeline and reduce the infrastructure required for an annual savings of half a million dollars.   


Apache Parquet vs. Apache Avro

Apache Avro is also a binary file format, like Parquet. However, Avro is a row-based file format, similar to CSV, and was designed for minimizing write latency. Avro files have far fewer rows per file than Parquet, sometimes even just one row per file. The most common use case for Avro is streaming data. 

All of the considerations above of Parquet vs. CSV apply to Parquet vs Avro. 

Advantages of Avro over CSV

Apache Avro files have the schema of the records embedded in the file. This is especially useful in streaming analytics, such as when the schema of the application producing the data can change over time.

Avro is also a binary protocol, providing better read performance than CSV.

Advantages of CSV over Avro

The main advantage of CSV over Avro is due to the fact that it is a text-based file format. This makes it more human-readable, which can be easier for exchanging smaller files with people not familiar with the data, such as other business units or external organizations.

Ready to go deeper? Read more technical articles on Apache Parquet.

Ready to Get Started? Here Are Some Resources to Help

Alteryx Analytic Platform and Dremio Open Lakehouse combine to simplify data operations and enable broad access to the data lake

Webinars

Unlocking Analytics from your Data Lake with Alteryx and Dremio

As a result of the accelerated growth of data lakes, data teams have been forced to either build and maintain expensive and complex processes to make new sources of data available for use in proprietary data warehouses, or hinder access to analytics for all data consumers. In this webinar, learn how the Alteryx Analytic Platform and Dremio Open Lakehouse combine to simplify data operations and enable broad access to the data lake for exploration, discovery, and insights.

read more

Webinars

How Open Lakehouses Simplify Analytics on Cloud Data Lakes

Cloud migration affords your organization the opportunity to rethink the fundamental architecture of corporate reporting and analytics system design. This webinar explores how cloud resources and services eliminate the need for costly data warehouse solutions that require significant data integration and preparation efforts.

read more

Guides

Data Virtualization vs. Data Lakes

Businesses need to aggregate data sources to be able to use the data. Data virtualization and data lakes are popular approaches, but which to choose?

read more

Get Started Free

No time limit - totally free - just the way you like it.

Sign Up Now

Watch Demo

Not ready to get started today? See the platform in action.

Check Out Demo