Migrating to Parquet – The Veraset Story
Veraset is a data-as-a-service (DaaS) company that delivers PBs of geospatial data to customers across a variety of industries. We build and manage a central data lake, housing years of data, and operationalize that data to solve our customers’ problems. I recently gave a talk about the specifics of file formats at Spark+AI Summit 2020 that generated a lot of questions about my company’s migration from CSV to Apache Parquet. As CTO of a DaaS company, I saw firsthand how this migration had a drastic effect for all of our customers. This session will drill into the operational burden of transforming the storage format in an ecosystem and its impact on the business.
Vinoo Ganesh is Chief Technology Officer at Veraset, a data-as-a-service startup focused on understanding the world from a geospatial perspective. Vinoo previously managed the compute team at Palantir Technologies, tasked with managing Spark and its interaction with HDFS, S3, Parquet, YARN and Kubernetes across the company. Most recently, this team was closely involved in pushing forward a number of open source Spark initiatives, including a DataSource V2 implementation and the External shuffle service.