May 3, 2024
Building an Open Source Data Strategy with Dremio
Open-source technology is fundamentally collaborative and transparent in nature. It fosters innovation, flexibility, and community-driven development for more robust and accessible solutions. Learn how the Dremio Unified Analytics Platform can be a core part of your open source data strategy. We’ll review the role of open-source technologies in shaping modern data strategies and the benefits they offer. We’ll also learn how Dremio harnesses open-source tools, including its Apache Iceberg native data catalog that uses Project Nessie, and its foundational use Apache Arrow for in-memory analytics and Apache Arrow Flight for high-performance data transfer.
Topics Covered
Sign up to watch all Subsurface 2024 sessions
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
JB Onofre:
All right, welcome everyone to this session, but really an open source data strategy with Dremio. I will do the introduction myself because there’s a small issue, so no worries. I will do that. So I’m JB, Principal Software Engineer at Dremio. I’m also ISF member, so Apache Software Foundation member and Board of Director. I’m PMC member on roughly 20 Apache projects on different area. And I’m contributor on Apache Iceberg, Apache Aero and others. So this session is more like give you a kind of overview and some high level details about some open source projects that you can leverage with Dremio and how you can integrate with your Dremio ecosystem.
Data Strategy and Data Lakehouse
So first, when we talk about data structure and data echoes, we can split the overview of the ecosystem in two main areas. The first area is about data wells. I would say the classic topology we know. So we have property engine, property format, property storage. And so we have different data sources and you bring basically your data to the wells. So this approach is pretty expensive and you have a kind of black box, meaning that you don’t know exactly what is inside. The only thing you know is basically the data you push there. So you have a kind of vendor looking and also it’s pretty hard for your tools to interact with the data warehouses.
What we say also is we basically bring the data to those TV engines and it’s all from the silos. So if you have different business use cases in your organization, then you will have to have different warehouses. The new thing is mostly the data lake houses. So there, you can see that is a very kind of modular architecture where we have the storage on one side, we have a table format on another side and their engines and all is decoupled. So it means that most of the time now we bring the engines close to the data. So we don’t have any vendor looking and the cost is reduced because the architecture exactly match what you need. Obviously sometimes it could be a little bit complicated to set up this architecture because you have different components. So you have to choose the right components and also to know how to set it up.
If we summarize the components of a lake house, on the bottom we have a storage layer. So the storage layer is basically an object storage. So it could be an S3, it could be any object storage you can find. On top of the storage, you’re gonna have file format. In this situation, we’re gonna have Apache Parquet, could be also Orc or directly Avro file. But Parquet is the factor standard, I would say. Table format, so the file format is really storing the data in files. And on top of that, you have the table format which is basically how you organize the data, how it lives in different tables. So here we’re gonna talk about Apache Iceberg, Delta, Apache Payment, Apache UDI. So any kind of table format. On top of that, we have a catalog, that’s something mostly related to Apache Iceberg. So the catalog is how you track your tables, views and namespaces. So you have different catalogs like an AC, JDBC catalog, et cetera. And finally, on top of that, that leveraging the catalog and table format, you have the query engines. So the query engine is really how you read and write data to your Iceberg tables and views at scale. So it could be Draino, it could be Spark, it could be Flink. So that’s the classic components you find on the data lakers.
Why Does Open Source Matter?
And why does open source matter in this ecosystem? There are different benefits, in my opinion, about open source. I tried to categorize these benefits in three main personas. So the first persona is for companies and organization. But once you contribute on this open source stack. So the first good thing about open source is the impact on development and direction. So you can directly know where the project is going. You can also propose new stuff on this project that fits exactly your use cases. You can also serve on expensive resources and go to market by leveraging open source component. You can have a tool which is faster to set up and to use and no expensive cost on license and so on. You also have an increased vision and revenue market share. So the increased vision is really where the market is going and also adapt to this direction. Another aspect is also interoperability where you can have something that can be pluggable with all the components in the ecosystem.
For contributors, so people that contribute directly to these open source projects, you are not alone, you are part of the community. So it’s pretty interesting because you can learn from the community, you can interact with the community. So it’s pretty amazing. You also do innovation. So instead of just following the new project, you contribute directly to the product. So it’s very interesting to follow the new direction, new innovation that pop up from these different projects. For users, so basically users could be individual or could be companies using open source projects. You don’t have any vendor looking. I will explain why open governance matter in this about vendor neutrality and vendor looking in a few seconds. You also have better software, more reliable. It’s simply because the open source stack is heavily used and tested. So it means that you are potentially more test, more security check, and so on. And actually you’re also leveraging open standard, meaning that if a provider you are using based on open source project, you can switch to another one pretty easily.
I said that vendor looking is a value, no vendor looking, but actually it’s, I would like to put some perspective about that. Having a project with an open source license is not a grantee of open governance. It means that license can be changed at any time. We have recent stories about that, but some project that was open source up to now decided for different reasons to change the license. So the project governance is not necessarily open and community-driven. I mean, even if a project use an open source license, vendor neutrality is questionable because if you have a single vendor involved in a project that decide of everything, it’s not really an open governance. And that’s why I think that all the open source foundation like Apache Software Foundation is a grantee of open governance and vendor neutrality because the governance is delegated by at the foundation level and by a group of people who are the PMC member. So I think that foundation like Apache is a perfect house for open standard, collaborative development and a community-driven governance.
Table Format: Apache Iceberg
That said, let’s take a look on the first component in the Lakehouse ecosystem. So Apache Iceberg app, I guess you learn a lot on Apache Iceberg already, but I just wanted to clarify what is Apache Iceberg. So on the bottom, you have data files. So let’s take the example of Parquet files. So basically it’s where the data are located. So the data file is really where you are storing the data. On top of that, you have a manifest. So manifest represent a bundle of data file. So it’s not only the location of the data file, but it’s also a lot of statistics about the data files. So for instance, you can use these statistics to filter and skip some file if it does not contain any data that are relevant to your queries, for instance. So the manifest is really the location of the data files plus statistics. And these manifests are actually aggregated in something named manifest list. So this manifest list is actually a snapshot in the Iceberg awarding, and it aggregates the statistics for each manifest inside the same snapshot. So the metadata is used to apply partition printing, other inquiries to skip manifest with partition values, and everything that is used by the planner, the query planner, to optimize the query.
On top of that, you have the metadata.json. So the metadata.json is basically containing all the details about the table, such as the partitioning scheme, the schema of a table, and also all the history of the snapshots. So remember, a manifest is a snapshot. A manifest list is a list of snapshots. On top of that, you have the Iceberg catalog. So this catalog tracks the list of tables, reference to the location. So basically the catalog contains the location of the latest metadata.json that is used for the table.
This is how Apache Iceberg works. And thanks to that, we have different features bring by Apache Iceberg. So first, we have asset transactions, so atomicity, currency, isolation, and durability. That is guaranteed through the usage of snapshots and also the catalog. Schema evolution allows you to add, drop, update, or rename columns. So basically change the table schemas. And thanks to the snapshot and the manifest list, you have time travel, where basically you can have multiple queries working on different state of your data, thanks to the Iceberg metadata that track the snapshot and the history of the data. Item partitioning is all about partition of the tables to have better performances. And item partitioning is also prevent user mistakes that with incorrect results or extremely slow query. So Iceberg tracks partition, not only by the row value of the column, but also by transform value. So for instance, if you have one column, which contain a long, you can convert this long as a date and use this date for the partition.
So thanks to that, you have this kind of feature provided by Apache Iceberg. So obviously you can use Apache Iceberg in Dremio, and Dremio actually use Apache Iceberg behind the hood for you, but you can also have additional layer to interact with Dremio. So the first one is Apache Arrow.
Caching and IPC: Apache Arrow
So Apache Arrow is interesting in two cases. The first case is about the caching mechanism you have in Dremio. So basically what is Apache Arrow? The first thing about Apache Arrow is an in-memory column format spec. So on the picture I put there, you can see that I have session ID, timestamp, and source IP with different rows. So we have four rows, and you can see that normally in traditional memory buffer, you have the row organized with first the session ID, then the timestamp, then the source IP, which is not super efficient when you have to scan or interact with a large chunk of data. Arrow memory buffer uses another approach, which is column-based. So you can see that first for all rows, I have all rows gathered in a column. So session ID gathers all the session IDs for all rows. So this is the spec, and you have different implementation and libraries that should implement the spec. So you have Arrow in Java, in C++, et cetera. So basically these memory buffer is what we name Arrow vectors. And the good advantage of Arrow vectors in caching is because it’s reducing the distance to data, so the DTD, meaning that instead of creating the data every time, you can have a kind of caching, and you can have time and resources very optimized to request the data.
Arrow is part of the core of a Dremio engine, but for you to interact with Dremio and to create your open source ecosystem, what you can use is Arrow flight. So imagine you want to communicate with Dremio and to get some data from a SQL query from a Dremio engine. Then you can use Arrow flight. So Arrow flight is simply a client server framework that simplify the transport of large data set. It’s basically based on Arrow columnar format, so the Arrow record batch over gRPC. So you can see there that here, you have first a get flight info, which is sent by the client, and then the server return the flight info, and then the client do a do get. And so you have a lot of flight data, but it’s streamed from the server to a client.
So basically the data related part of a type from gRPC, are optimized message represented through Arrow record batches. The good thing is endpoints can be read by clients parallel. So you can see there that I have data located in different Dremio executors, for instance. So the client can get the flight info first, which is the planner. And so then you can do get on multiple data chunks at the same time. So you can read data in parallel, which is more efficient. So basically the planner delegate the work and did it with the data locality and load balancing. And the node, so basically the executor, distributed cluster that can take different role. A planner can, with get flight info, or data streamer with do get and looper. This is the picture on the bottom there. But sometimes flight, it’s not so easy to write from your standpoint, from a client standpoint.
Apache Arrow Flight SQL
So on top of that, you can use Arrow Flight SQL. So Arrow Flight SQL is simply, as soon as you have a database or backend, like Dremio, but use SQL, then you can leverage Arrow Flight directly to interact with the SQL language. So basically it’s a set of commands standardized for SQL interface. So basically you can do query execution, prepester, then database catalog metadata, like getting tables, columns, data types. And you have all the SQL syntax capabilities. So like, extra. So it’s a language and database agnostic, meaning that the same client implementing Arrow Flight SQL can interact with any Arrow Flight SQL server. So the code that you write today by the work with Dremio, if later you switch to another engine that still support Arrow Flight SQL, then you can, your client doesn’t change, only the location change. So there are no need to update the client side driver, for instance.
And this Arrow Flight SQL is basically wrapped into JDBC and ODBC drivers. So it’s pretty convenient because you can add directly JDBC driver in your Java code, and it’s gonna work out of the box. So this is an example of a JDBC URL that you can use in your client code in Java. So you can see JDBC colon Arrow dash Flight dash SQL. And so then directly you will be able to use any kind of server that use Arrow Flight SQL. Dremio obviously support Arrow Flight SQL, so you can directly interact with Dremio using Arrow Flight SQL. And you also have JDBC and ODBC drivers available.
Catalog: Nessie
So we have the storage, which is basically Apache Iceberg in Dremio. You can interact with Dremio using Arrow Flight SQL through JDBC or ODBC or directly flight. And finally, the last piece is the catalog. I’m gonna talk about Nessie because in Dremio Arctic, it’s based on Nessie. And as it has been announced yesterday, Nessie gonna be a part of the Enterprise Edition of Dremio.
So Nessie is more than an Apache Iceberg catalog because you have the access and control of the tables and view, but you also have a version control. So it’s inspired by Git, is a data as code. So you can have commit, which is a consistent snapshot of all tables at a particular point of time. Remember what in Iceberg, we have snapshot by table because it’s on the metadata. With commit in Nessie, you can have snapshot on multiple tables because it’s catalog level. You have a branching, so it’s a human-friendly reference to a commit. So remember a commit is a snapshot, so you can create a branch based on the commit. You can also tag the commit. So it’s a human-friendly reference, like point to a particular commit. And you have a hash, which is an extra decimal string representation for one commit.
The benefit of this approach is you have isolation. Thanks to the branching, you can experiment data without impacting other users. Like you create a branch with your data, you do some changes of the data, and then you decide to delete your branch or merge back your change on another branch. You can also ingest, transform, and test the data before exposing to other users. So imagine you create a branch, you apply some changes. It’s good, you verify, then you can merge. So it’s a kind of staging data you can do. You also have a data version control thanks to the commit and tag. So you have historical database on time and tag. So it means that you can see anything that happened to your branch and decide that you are back to this time or back to this tag. So it’s super interesting when you do, you have to recover from mistake. So for instance, you change the data that you should not do, then you can go back to the previous tag or point in time.
Super interesting also with the hash is all about the governance. So all changes to the data and metadata are tracked. So you know exactly who did this change and what data has been changed and when. So it’s pretty convenient to know exactly the impact analysis on your data.
This is an example of how it works. So this is pseudocode. So in this is slightly different, but it’s to give you an idea how it works. So out of a box, as soon as you start an AC, you have a main branch. So on this main branch, I create a table. I insert two recordings into this table, and I create a second table, and I insert two records into this table again. If I do an AC log, so I will have a list of all the hashes. And so you can see that I have hash one corresponding to a table has been created. Hash two is when the data has been added to the table. Hash three, when the table two has been created, extra. And then what you can do is you can create a tag. So you do an AC tag, my tag, on top of hash four. So your tag, my tag, will correspond to hash four. And so you can see that if I insert recording to table one, then if I select count from this table, I have three records. And if I do the same, but at the time of the tag, I only have two records corresponding to the two first records I inserted at the beginning. So you can see that thanks to that, you can manipulate your data as code.
And the good thing is Nessie works with all engines. So you can start to use Nessie with Spark or Flink. And then obviously you can use Dremio, because Nessie is fully integrated into Dremio. So you can see that I create a branch. I use a branch, I insert data, and you have the current using data frame in Spark or data in Apache Flink. Okay.