Building a Data Factory: A Generic ETL Pipeline Utility Case Study
FactSet, a leading provider of content in financial services, is focused on continuously improving our data pipeline and data fetch APIs. Most pipeline utilities like Flink and Spark require writing code to their API to define the pipeline, and to cover the breadth of content we offer, various departments have had to write custom ETL code for adding value at various parts of the content enrichment process. In order to standardize and simplify this common workflow, we decided to create a configuration file-based utility that still gives us the granular control we need, but allows encapsulation of the data movements and flows in centralized config files, reducing or eliminating the disparate custom ETL scripts. This case study will examine why we chose to leverage Golang and Apache Arrow to mix new data with our legacy sources and existing stack as we modernize our fetch code paths, and discuss other technologies we leveraged in order to do so.
I enjoy working on data storage and retrieval. I’ve worked on various open source databases and our internal time series database. A lot of my work has been around platform migrations and breaking up monoliths into SOA. I enjoy taking on big data and low latency problems. Data flows, data pipelines, structured streaming and CDC are some of my areas of interest. I also enjoy finding creative ways to automate repetitive data operations engineers face on a regular basis.
Hailing from the faraway land of Brentwood, NY, and currently residing in the rolling hills of Connecticut, Matt Topol has always been passionate about software. After graduating from Brooklyn Polytechnic (now NYU-Poly), he joined FactSet Research Systems, Inc. in 2009 to develop financial software. In the time since, Matt has worked in infrastructure and application development, has lead development teams, and has architected large-scale distributed systems for processing analytics on financial data. Matt is a committer on the Apache Arrow repository, frequently enhancing the Golang library and helping to grow the Arrow Community. Recently, Matt wrote the first and only book on Apache Arrow, “In-Memory Analytics with Apache Arrow,” and joined Voltron Data in order to work on the Apache Arrow libraries full-time and grow the Arrow Golang community.
In his spare time, Matt likes to bash his head against a keyboard, develop/run delightfully demented games of fantasy for his victims–er–friends, and share his knowledge with anyone interested who’ll listen to his rants.
Ready to Get Started? Here Are Some Resources to Help
What Is a Data Lakehouse?
The data lakehouse is a new architecture that combines the best parts of data lakes and data warehouses. Learn more about the data lakehouse and its key advantages.read more
Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse
The adoption of data mesh as a decentralized data management approach has become popular in recent years, helping teams overcome challenges associated with centralized data architecture.read more