Building a Data Factory: A Generic ETL Pipeline Utility Case Study
Thursday, July 22 2021
FactSet, a leading provider of content in financial services, is focused on continuously improving our data pipeline and data fetch APIs. Most pipeline utilities like Flink and Spark require writing code to their API to define the pipeline, and to cover the breadth of content we offer, various departments have had to write custom ETL code for adding value at various parts of the content enrichment process. In order to standardize and simplify this common workflow, we decided to create a configuration file-based utility that still gives us the granular control we need, but allows encapsulation of the data movements and flows in centralized config files, reducing or eliminating the disparate custom ETL scripts. This case study will examine why we chose to leverage Golang and Apache Arrow to mix new data with our legacy sources and existing stack as we modernize our fetch code paths, and discuss other technologies we leveraged in order to do so.