Data Lake Technologies
Apache Iceberg is a new table format that enables multiple applications to work together on the same data in a transactionally consistent manner.
Project Nessie provides Git-like semantics for data lakes and enables a single transaction spanning operations from multiple users and analytics engines.
Apache Arrow provides a columnar in-memory format for flat and hierarchical data. It supports zero-copy reads for fast data access without serialization overhead.
Apache Arrow Flight
Apache Arrow Flight is an open source data connectivity technology that provides ten times faster data transfer rates than ODBC, JDBC and pyodbc.
Amazon S3 (Simple Storage Service) is an AWS service that provides object storage, which is commonly used as a data lake for analytics.
Amundsen is a metadata-driven app that indexes data resources (tables, dashboards, streams, etc.) and powers a page-rank style search based on usage patterns.
Apache Airflow is an open-source workflow management platform designed by Airbnb to programmatically author and schedule workflows and monitor them.
Apache Parquet is a columnar storage format compatible with your choice of data processing framework, data model or programming language.
Apache Spark is an analytics engine that provides an interface for programming clusters with implicit data parallelism and fault tolerance.
AWS Glue is a fully managed extract, transform and load (ETL) service that automates the time-consuming data preparation process for data analysis.
AWS Lake Formation
AWS Lake Formation is a fully managed service that makes it easier to bring data into a data lake from various sources using pre-defined templates.
Azure Data Lake Storage (ADLS)
Microsoft Azure Data Lake Storage (ADLS) is a fully managed, elastic, scalable and secure file system suitable for storing a large variety of data.
Hive metastore (HMS) stores metadata related to Apache Hive and other services in a backend RDBMS, such as MySQL or PostgreSQL.