Datanami: Big Data File Formats Demystified

May 16, 2018

Big data has many formats to choose from, and knowing which is best for a given use case can be challenging. Datanami wrote about Big Data File Formats Demystified.

Alex wrote about some of the key formats, and mentions Dremio as the company behind Apache Arrow:

“Different Hadoop applications also have different affinities for the three file formats. ORC is commonly used with Apache Hive, and since Hive is essentially managed by engineers working for Hortonworks, ORC data tends to congregate in companies that run Hortonworks Data Platform (HDP). Presto is also affiliated with ORC files.

Similarly, Parquet is commonly used with Impala, and since Impala is a Cloudera project, it’s commonly found in companies that use Cloudera’s Distribution of Hadoop (CDH). Parquet is also used in Apache Drill, which is MapR‘s favored SQL-on-Hadoop solution; Arrow, the file-format championed by Dremio; and Apache Spark, everybody’s favorite big data engine that does a little of everything.”

get started

Get Started Free

No time limit - totally free - just the way you like it.

Sign Up Now
demo on demand

See Dremio in Action

Not ready to get started today? See the platform in action.

Watch Demo
talk expert

Talk to an Expert

Not sure where to start? Get your questions answered fast.

Contact Us

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.