Building a Data Factory: A Generic ETL Pipeline Utility Case Study

FactSet, a leading provider of content in financial services, is focused on continuously improving our data pipeline and data fetch APIs. Most pipeline utilities like Flink and Spark require writing code to their API to define the pipeline, and to cover the breadth of content we offer, various departments have had to write custom ETL code for adding value at various parts of the content enrichment process. In order to standardize and simplify this common workflow, we decided to create a configuration file-based utility that still gives us the granular control we need, but allows encapsulation of the data movements and flows in centralized config files, reducing or eliminating the disparate custom ETL scripts. This case study will examine why we chose to leverage Golang and Apache Arrow to mix new data with our legacy sources and existing stack as we modernize our fetch code paths, and discuss other technologies we leveraged in order to do so.

Topics Covered

Apache Arrow Flight

Dremio Subsurface for Apache Arrow

In-Memory Formats

Interfaces

Speakers

William Whispell

I enjoy working on data storage and retrieval. I’ve worked on various open source databases and our internal time series database. A lot of my work has been around platform migrations and breaking up monoliths into SOA. I enjoy taking on big data and low latency problems. Data flows, data pipelines, structured streaming and CDC are some of my areas of interest. I also enjoy finding creative ways to automate repetitive data operations engineers face on a regular basis.

Matt Topol

Hailing from the faraway land of Brentwood, NY, and currently residing in the rolling hills of Connecticut, Matt Topol has always been passionate about software. After graduating from Brooklyn Polytechnic (now NYU-Poly), he joined FactSet Research Systems, Inc. in 2009 to develop financial software. In the time since, Matt has worked in infrastructure and application development, has lead development teams, and has architected large-scale distributed systems for processing analytics on financial data. Matt is a committer on the Apache Arrow repository, frequently enhancing the Golang library and helping to grow the Arrow Community. Recently, Matt wrote the first and only book on Apache Arrow, “In-Memory Analytics with Apache Arrow,” and joined Voltron Data in order to work on the Apache Arrow libraries full-time and grow the Arrow Golang community.

In his spare time, Matt likes to bash his head against a keyboard, develop/run delightfully demented games of fantasy for his victims–er–friends, and share his knowledge with anyone interested who’ll listen to his rants.

Building a Data Factory: A Generic ETL Pipeline Utility Case Study

Speakers

Ready to Get Started? Here Are Some Resources to Help

Whitepaper

Dremio Upgrade Testing Framework

Whitepaper

Operating Dremio Cloud Runbook

Webinars

Unlock the Power of a Data Lakehouse with Dremio Cloud

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?