42 minute read · January 31, 2018

The Heterogeneous Data Lake

Tomer Shiran · Founder & Chief Product Officer @ Dremio

Webinar Transcript

Tomer Shiran:
My name is Tomer Shiran. I’m the co-founder and CEO of Dremio. Dremio is a stealth data analytics startup, so we’re not talking about what Dremio’s doing today, but I’ll give you a little bit of background about myself, as well as my co-founders. Dremio, we actually started that last year. Before that, I was a VP of Product at MapR, and joined there when the company was five people. Co-founded Dremio with Jacques Nadeau, who is the creator of Apache Arrow and Apache Drill, and Julian Le Dem joined us as an architect from Twitter. He’s the creator of Apache Parquet, which is a columnar format that’s widely used in the Hadoop ecosystem.This talk is about the rise of heterogeneous data and how we can address that, and kind of providing some approaches to being able to query data across different sources. Basically, what you’ve seen or what we’ve all experienced in the last, let’s say, five to ten years is the rise of these non-relational databases. If you go back ten years ago, it was all relational databases. Everybody was using just Oracle, SQL Server and maybe a few others, and you just had a lot of those, and that was the de-facto standard place to store the data. Well, now you look at most companies, they have Hadoop, they have HBase, they have Elasticsearch, and Solr, and Cassandra, and MongoDB, and some people have data in the cloud on systems like S3, Azure Blob Storage, Google Cloud Storage, all these different systems. And what’s interesting here is that these are not relational databases.For the most part, they don’t support SQL. They don’t support the ability to join data. They have various levels of execution capabilities, so Mongo has more. Something like S3 doesn’t have any execution capabilities. But it’s a great world to be a developer in. For those who are building applications, you can go and download these things. They’re free. They’re open-source or distributed, they scale, and they have great APIs, and you can go download them and use them. That’s why there’s a lot of adoption. They also provide a lot of developer agility. I like to call it that way. But the challenge here is that it’s really hard analyze the data now. In many ways, the analysts have kind of gone back in time where they can’t just connect a BI tool to these systems and query and analyze the data, right?If you ever tried to do this, connect, let’s say a tool like Tableau, or Power BI, or Qlik, or something like that to these sources, you either can’t connect to them at all or it’s way too slow for any kind of interactive analysis. And then you also get these complexities where many of these systems, people are storing complex JSON, nested structures, arrays and things like that, and that doesn’t map well to the relational model, which is what the BI tools expect. And then you also have situations where, because the systems don’t enforce schemas, you end up having data that’s not clean, right? People put data in MongoDB where it’s generally a string, but in a few places it’s an empty map instead of null or things like that. Real-world examples I’ve seen. So what do you try to do? You try to ETL the data into a single source, right? You try to ETL it into something like a Teradata, or a Vertica, or something like that, or Redshift in the cloud.And so, in the relational world, that was kind of doable. It was still difficult, right? You’d take an off the shelf ETL tool, something like Informatica, and you’d ETL the data. That was kind of possible. One of the nice characteristics there was that the source systems were all relational and had SQL, and so the destination system also was a relational system, typically, and so you didn’t lose any of the processing capabilities. It was kind of a one-to-one mapping, and then all the data was flat, so you can pretty easily to that kind of mapping. Fixed schemas were there as well, but even then ETL was hard, and so you think about in this new world where you have data in all these weird systems, right? How do we even go about doing or getting the data into a single place where we can analyze it? It’s really complex.You end up having to have many data engineers on staff to actually build custom code that does this, and they of course hate this kind of work where analysts are coming up to them and asking them to do, you know, “Can you also add this one other table from my Hadoop cluster into my Teradata? Can you ETL this, can you change this?” And I was talking to a company that had 5,000 unknown tables in their data warehouse that over time got created and nobody really paid any attention to. The other thing is you lose all the capabilities of the source system. So, if you put data, let’s say log files in Elastic, you put them there because you wanted search on those messages. Once you put them into a relational data warehouse, you’ve lost those search capabilities. And then there’s also an issue of scalability. Many of these systems, on the circle here around, are scalable, distributed systems. A lot of times you have many, many terabytes or even petabytes of data there, and getting all that into a single place is really hard. A lot of times, impossible.I think people are starting now to realize that ETL is the big problem. It was always a big problem, but now it’s becoming an even bigger problem because of the nature of the systems that companies are using. I talked to one company that replaced … A very large Fortune 100 IT company in the Bay Area in California, and they had replaced Oracle with a combination of Cassandra and Elasticsearch, for example, and they lost that ability to do analysis on that data. It was their order purchasing system. How can we try to address this a little bit better? And that’s what I’d like to talk about here, which is basically building this heterogeneous data lake. Coming to a conclusion that we’re not going to be able to get all the data into a single system, or at least not all the time, and not have it be fresh there. We’ll still maybe be able to load some of the data into, say, a data warehouse, but not all of it. So, how do we do this? We basically need a platform that enables you to analyze data across disparate data sources.It has to be storage agnostic, and what I mean by that is the ability to connect to different data sources, whether they’re relational or not relational, to be able to do joins across different systems, and also to be able to push down unique capabilities to each of those systems. If you’re doing a query that leverages data in Solr, you want to be able to take advantage of Solr search capabilities as part of that query. Ideally, it would also be client agnostic. So, we wouldn’t force you to use a new end-user tool that the business analyst or the data scientist, if they like using R, they can use R. If they like Tableau, they can use Tableau. If they like Excel, they can keep using Excel. So, that would be kind of the ideal world, and of course, this is all about big data, so scalability and performance … It goes without saying that we have to be able to provide those characteristics, and we have to be able to support both simple queries, as well as more complex analysis.The first technology that is changing the game here is something that we announced about a month ago with a variety of other projects. It’s called Apache Arrow, and it’s really … It’s something that the entire big data ecosystem is now rallying behind as the new way to store data in-memory. There are over 15 different open-source projects now onboard with using Arrow as their internal memory representation that includes things like Drill, Impala, Kudu, Spark, R, Python, the Pandas library, Storm, many others. So, this is really an attempt to create a standard here where everybody uses the same columnar in-memory capabilities, which makes each of these systems go anywhere from 10 to 100 times faster, but also enables data to be moved between different systems without the serialization and deserialization overhead. So, what does Arrow enable us to do and why is this important to create a heterogeneous data lake? The first thing is, before we had Arrow, anytime two systems wanted to communicate, they would have to serialize and deserialize data.It was all based in individual API calls between the systems, and that was really inefficient. In many cases, you can see that 70 to 80 percent of the CPU time was actually spent just on the copying and serializing of data. So, by having a single way in which data is represented in memory across all these different systems, we can actually get really high performance when we’re talking between different systems. Imagine a Python UDF being used in a query in Drill, or in a query in Impala. Being able to do that with high performance and not just at the marketing level, “Hey, this supports Python,” is a big advantage. Similarly, if we wanted to run a Spark job on data in Kudu, being able to do that without serializing and deserializing data would be beneficial. With Arrow, what happens is that same memory structure is used across all these different technologies, and it also means that all these different projects are now collaborating on a single library, or a collection of implementations in different languages for this, and we can get more resources behind a single technology.Arrow is really designed for CPU efficiency, so the reason to store data in-memory in a columnar format is that we can take advantage then of the modern CPUs and they have these things called SIMD instructions, or vectorized operations, and that lets the CPU run the same operation on multiple values in an array in parallel, on a single CPU cycle. So, if we can organize the data in a columnar format where all the session IDs are stored one after the other in-memory, and then all the timestamps are stored one after the other, and then all the source IP addresses are stored one after the other, we can be much more efficient in terms of leveraging the CPU on the machines to execute these queries. The second piece of technology we’ll use here, which I’ll demo today, is Apache Drill, which is a storage-agnostic query engine.Apache Drill has plugins for different data sources and allows you to query things like HDFS and MongoDB and S3, and other data sources. What it exposes at the top is basically an ODBC and JDBC drivers, as well as a CLI that more technical users can use and just type SQL queries. The nice thing here is now every data source looks like a relational database. You can run SQL on top of anything, and the other thing you can do is you can join data across different data sources. And because Drill is a distributed system, you don’t end up running into the typical performance problems that you would otherwise have when you’re trying to join data between different data sources. The system can scale out as much as it needs to meet the performance requirements. There’s actually an article on ZDNet a while ago that called it Omni-SQL, which is basically that … What Drill does is basically SQL on everything, so it’s not really about SQL and Hadoop, it’s about SQL on lots of different data sources.I’ll talk a little bit about the architecture now, about how this works. Everything starts with a drillbit, that’s the demon. So, every node in the cluster has that single demon, and we’ll talk about that later, but that can be either colocated with the data. So, it could be running in the data nodes in Hadoop, or the Mongo dnode on a MongoDB cluster, or it could be not colocated in the case of, say, S3 where you obviously can’t run on the data source. Drill is a high performance query executor, and all the processing happens in-memory. In fact, the Apache Arrow technology was originally … At least the Java implementation, was extracted from Apache Drill, and now is being embraced by a bunch of other projects. Basically, what happens is the BI tool, the ODBC and JDBC drivers, they’ll connect to one of the drillbits in the cluster. They’re all identical, so it doesn’t matter which one, and then that’s where they submit the query to and then that gets planned and executed on all of the clusters.It also has a built-in web UI and there’s a CLI as well. What does this look like? When we have … A typical organization has all these different data systems, right? And those could be NoSQL databases, like in the examples here, or Hadoop. They can also be relational databases that you have. Those are clustered services. You may have data stored in the cloud as well where you’re not controlling the servers. It’s just a service that you can connect to, and then people may have data on their desktops as well. As you can run drillbits across all these different systems, and you can run it in colocated mode or you can run it separately, and what’s important more is really what’s the network speed? How fast does it take drillbit to get the data that it needs over the network? If there is a latency, then it would be better to put it locally because it’ll take advantage of data locality and schedule the queries to run based on where the data is, similar to a mapper use type approach, except with much lower latency.The architecture of Drill internally looks something like this. So, you have the core execution engine, and then there’s a standard API or a … This is a Java API, and you have a bunch of plugins. So, there’s a plugin that implements each of the data sources, and what’s nice here is that the plugin can do … There’s a lot of extensibility points here where the plugin can actually override or add additional planning rules to the SQL optimizer. So, for example, if you see a scan and then a filter, and we know that that filter can be pushed down into the data source, we can actually reorganize that so that the … When getting the data from MongoDB, we can actually ask MongoDB to do that filter using its own query language, and only return back the records that match. There’s a plugin for each of these different systems, and there are more being added, and I’ll actually show you one today that’s kind of in alpha/beta, which is the Elasticsearch plugin.There’s also a file system plugin, which allows the system to talk to the systems like anything that supports the Hadoop file system API, so that’s HDFS, MapR-FS, S3. All the different cloud vendors have HDFS compatible storage or blob storage, right? Azure has one, and Google has one. And then the file plugin also has a format plugin API, which allows us to support different file formats. Things like JSON, and Avro, and Parquet, and CSV, and now there’s an XML support I believe as well. With all this extensibility that allows us to have a single distributed query engine that has all the logic or all the capabilities around query execution, running in-memory, doing all of the columnar processing, but then being able to actually interact with all these different systems and take advantage of their unique capabilities. Hopefully, I’m gonna try to do a live demo here, connecting over the Internet, and then I’ll show you what it looks like, for example, to push down a query into Elastic.Now, one of the challenges with a live query is they have this network cable here, which apparently is for that, but they’ve stopped making network ports on computers, like, five years ago. So, we’ll try to use the Wi-fi, and then at that point, everyone will stop using their iPhones and we’ll just do that. So, what does it look like to query the data? There’s one global namespace, which has, effectively, three components. The first component is the data store. In this case, it’s a simple … We’re just doing a SELECT *, but the production … That’s the name that we’ve given to this data store. That could be a single HDFS cluster, it could be a MongoDB cluster, it could an S3 bucket, and so forth. Underneath that, we have something called a workspace. A lot of systems already have a three-tier namespace, so you think about, let’s say MongoDB, there’s databases and then you have collections, right? So, in Mongo, workspace is basically a collection. In Hive, it’s a database, if you have data in the Hive metastore. If you just have files on HDFS, you can actually define a directory to be a workspace, and then you can query files or other directories within that.Then the last component is actually the table. So, depending on the system, the table would be, say a MongoDB collection, an Elasticsearch type, a directory of log files on Hadoop or S3, a table in Oracle, things like that. An example here would be selecting the name from just a single MongoDB collection, so I’m doing SELECT name. Mongo happens to be what I named this specific Mongo cluster, so it could be MongoOne, it could be MyMongo, and there could be many MongoDB clusters connected to this system. The second component is the name of the Mongo database, so in this case it’s Yelp, and then the last component is the collection, so business. I’m selecting the name of the first business basically, in the dataset. We’re not the first anyone of them. SQL doesn’t guarantee anywhere. The second example here is I’m querying a JSON file. So, this is a file that has a JSON record on every row, and basically I’m selecting the name. It’s the same dataset. Actually, the dataset we’ll use in all these examples, as well as the demo, is a dataset from Yelp.Yelp has this thing called the Yelp Academic Dataset, which is a subset of Yelp’s data. So, it has a bunch of reviews and businesses and users and things like that, so you can actually see some interesting things in that. Namespaces and Tables. So, this is basically what I was talking about. In HBase, for example, the workspace is a namespace. So, each of these systems already has some kind of hierarchy of namespace, and we just take advantage of that. If you think about this, joining across data stores from a user’s standpoint is actually very easy because you have a single global namespace and all the data in all your different data stores is available in that namespace. All you have to do to join two datasets is refer to them. In this case, we’re looking at one dataset, which is dfs.root, and then we’re looking at a file that’s /yelp/review.json. Then the other dataset here is mongo.yelp.business. So, we’re just joining those two things as if we were joining two tables in Oracle or MySQL. In this case we’re joining on the business ID.Now, when we say dfs, it’s a little bit misleading here. It could be any name, so I could have multiple HDFS clusters and multiple S3 buckets. Let’s look at a more interesting example here. We’re asking which business has the most reviews on Yelp? Here we’re doing a join between these two tables, and we’re then aggregating by business ID and name, and ordering by the reviews. This is a … I mean, it’s still a simple SQL query. It’s an aggregation, and then a sort. We are doing a join here between data in Mongo and data in HDFS. You can see that the business Mon Ami Gabi has 3,695 reviews. Now, one of the important things here is that a lot of data today is nested, whereas in relational databases, we always had flat records. That’s not at all the case when we start talking about systems like Mongo and Elastic and Hadoop, where data has much more flexibility in terms of storing nested fields, and arrays, and things like that. So, the language naturally has capabilities. We’ve added capabilities here to deal with these types of structures.The simple one is an array, so accessing the first field of an array is SELECT categories[0], and you have an example record on the right. The second example here is accessing a map, so I use dot notation, as you would expect, to access the map. If I want to flatten an array, I use the FLATTEN function, and basically that turns the array into multiple records. I’ll show you a deeper example of that. KVGEN is … What we found early on is that a lot of people will actually encode values or important data in the keys of a JSON document, so rather than you’d expect only in the value after the colon and a lot of times you’re putting a URL as the key in JSON, so you want to be able to take advantage of that data in your analysis. If you use the KVGEN function, it’s basically taking that type of structure and creating key value pairs. You can combine that with FLATTEN, and then you’ve basically expended a map into multiple records, each with a key in a value.You can also access embedded JSON blobs. So, if you have a system like HBase, or you have a CSV file where one of the records is actually JSON, and you want the system to know that that’s JSON, so you can use all these functions and make it a first class part of the record, you just have to call this function called CONVERT FROM. It’s kind of like casting, and from that point on the JSON is parsed by the system. It’s no longer just a blob, and you can do all these types of things such as flattening and accessing the inside structures of that JSON. JSON is not the only format. There’s other formats that we support here. This is an example of just accessing the first array element. FLATTEN is more interesting. Here, what we have on the left is … You’ll see we’re selecting the categories field, so categories is an array, and I’ll show this in the example later, but it’s an array, and most BI tools … So, if you think about something like Tableau, they don’t really handle arrays. You can’t do much with it.Let’s say you wanted to look at what are the top categories? You’d want to group by the single individual category, and so you would want to flatten the categories and maybe alias that as category from the businesses dataset. You use FLATTEN and you can see that we now have these four different categories, which were actually two arrays. We’ve actually flattened them. There’s just four different records, one for each category. Now, of course, you would have other fields in that same dataset. In this example, we have a name, and we have categories, and if we use FLATTEN, we can also select the other fields. We can also use * here, and those are just repeated. In this example here, you can see that the name Deforest Family Restaurant had two categories. Now, there’s two records. One for each category, and the name is just repeated in both of them. If I did a group by now in category, I can count how many businesses per category.One other point here is that the system exposes ODBC and JDBC interfaces, and so basically any BI tool, any client application that talks SQL should be able to work here. It’s nothing special about it. You can also write custom applications using Python or whatever language, and be able to take advantage of these things. Let’s move to the demo portion here, and hopefully this will work. No more browsing the web, doing emails. Okay, so, first thing I’m gonna do is connect Tableau … I’m gonna use Tableau for this demo, and I’m gonna create … I have a cluster already running. It’s on AWS, so it is … Do we not have Internet? It has to start like that. Let me just make sure I have Internet. This was working ten minutes ago, so let me … I’m gonna try to reboot this. This is the thing that always works in Windows, so let’s hope this works. While we’re waiting for this, let’s … Are there any questions so far?Audience #1:
I don’t really get the interface communication between, like, Arrow and Elasticsearch, I mean, the internal memory representation in ElasticTomer Shiran:
What’s the connection between an Elasticsearch record and a record from a SQL standpoint?Audience #1:
Yes.Tomer Shiran:
Elasticsearch records are actually … They’re actually JSON, right? And so, like I said, the system here assumes that … Or doesn’t assume that anything is flat, so you should be able to just … Once you connect to Elasticsearch, each record in Elasticsearch is a JSON document, and you can query it just like you would query data in Mongo or anywhere else.Audience #1:
You get once side where everything was like Arrow-friendly-Tomer Shiran:
Oh, Arrow. I thought you said row. Okay, yeah, yeah, yeah. So, not all sources … So, Arrow is a new project, so these sources don’t yet support Arrow, right? So, the performance will get better once all the sources in the world support Arrow, and there’s ongoing work, for example now, to support Kudu and HBase and Cassandra and other systems. It’s not a requirement, right? And in those cases, we still do as best we can to make it efficient, but there is that extra step of kind of serializing and deserializing. Now, one of the things that we do do is we … For systems, such as that, we do push downs of, say, aggregations, and we’ll do them distributed. So, rather than going to, say, Elastic or Mongo and saying, hey, can you … Let’s say there’s an aggregation, and it’s on a field that’s in Elastic or Mongo, we don’t push down the entire aggregation and say, “Mongo, why don’t you do that entire aggregation?” ’cause that basically means that all the data comes back to a single node, right? Let’s say the next step is to join that with a big dataset in Hadoop. We don’t want all that data going to one place.We’d rather keep it distributed, so instead what we do is we’ll push down the aggregation to all the individuals shards on that system, and then do the aggregation at the Drill level, which is a distributed aggregation. It’ll be hash aggregation, and it’ll be much, much faster.Every drillbit running on every node has it … Yes, yes. So, in the planning phase when the planner starts planning what to schedule on every node, it will decide which piece of work to give to every node, and that’ll be based on the shards, right? So, there’s knowledge of that if it’s Elasticsearch, or Mongo, or whatever the system is. Yeah. So, this might not work. Alright, I don’t have a network here.Audience #2:
Connect.Tomer Shiran:
What’s that?Audience #2:
Press the connect button-Tomer Shiran:
Yeah, I just pressed it. Let’s see. Although, before I was connected and it wouldn’t work, so. Suspecting it just doesn’t want to work today. Okay, I guess we’re not gonna … I wonder if there is a … What is that? This should work. Okay, I don’t know. I’ll just take more questions.Audience #3:
What’s the standard interface for, like, queries? What’s the standard interface?Tomer Shiran:
What do you mean, like, in terms of BI tools?Audience #3:
Yes.Tomer Shiran:
Yeah, a lot of people use Tableau, for example.Audience #3:
For developing in SQL, Tableau wouldn’t be ideal, like, if you’re-Tomer Shiran:
If you are a SQL developer, you mean? Oh, you can … I mean, there’s a command line. A lot of people just use the command line, but I’ve heard people using Aqua Data Studio, and Toad. Yeah, I mean, there’s a lot of different options there. I actually have a piece of this recorded, so I’ll show you this here, but I’m not sure what the DNS issue is here. So, this is a cluster on AWS that … You should see this, right? So, this is a cluster on AWS that I’m connecting to, and I’ll … If I click on Tableau, it’ll show me all the different schemas here, so I can see I have an Elasticsearch cluster. I actually have plenty of databases in … There’s a few Elasticsearch clusters here, but the Yelp dataset is loaded into elastic.academic, and if you click on the [inaudible 00:29:45] here for that, you’ll see the business table. If you drag the users … This is all the users in Yelp that I’m gonna drag here, and basically, we can do an analysis of users on Yelp by when they joined the service.If you drag the … I wonder if it was a network cable that I hooked up here. I’ve gotten blue screen before. This is a new laptop. I’ve gotten blue screens connecting this to, like, a Wi-fi network. Just random things, but so, you drag the Yelp … And this is actually data in Elasticsearch, right? So, I wouldn’t otherwise be able to do anything else with Tableau and Elastic, but this basically puts a SQL interface on top of that, and all this gets pushed down into the source. So, I can see here that in recent months, fewer and fewer people have actually joined Yelp. And you can use all the Tableau capabilities, so in this case, I can do a percent difference. I can look at a heat map and say, okay, fewer and fewer people are joining this service. It’s all read in the five months of the dataset.Here, we’re looking at the average number of stars that a user gives to a business on Yelp. People that are registered on Yelp. So, we can see that … You can actually see that the average number of stars has gone down over time. If you change the access here, it becomes more clear if we don’t start with zero. So, we just start with 3.7 to 3.9, and then you can see that over time, people have become less friendly, I guess. More critical. The people who join … My theory is that people who joined Yelp later, they did that in order to … Because they were angry at some business versus the early guys who were just curious, and, “What is this new service?” Then, we can do something more interesting. I’ll show you what a push down looks like into Elastic.Here, we’re doing … SQL has this thing called a CONTAINS function, which is how … If you looked at Oracle or SQL Server, that’s how you do free text search, so here we’re doing SELECT * from the review dataset, and we’re going to add WHERE CONTAINS, and basically provide the search query. So, WHERE CONTAINS and then we’re gonna look for all the businesses that have the text bugs. What this is doing is, underneath the hood, the free text search is actually happening in the Elasticsearch cluster. So, we have that dataset. We can rename that and call that review bugs. I can join that with the business’ dataset because the reviews doesn’t have the actual name of the business. It has a business ID that you need to join with the business’ dataset. But, if I remember correctly … Like I said, I can join that here and just select the join key.I can also do a join across different data sources. If you remove the business dataset from Elastic, I have the same data, in this case, loaded into a MongoDB cluster. So, we go down here and we select mongo.academic … And so now what we’re doing is we’re doing a join across a search being pushed down into Elastic, and a collection in MongoDB. And then, also, what we’re trying to do is what business has the most reviews with the word “bugs” in it? So, we’re looking for … It’s probably not a good sign.Audience #4:
That join goes into itTomer Shiran:
Yeah, so the … Some of it is being pushed down at the source, but then the data comes into Drill and to the parallel and the distributed Drill cluster, and then the join happens there. Yeah. And so you can see here that the Riviera Hotel and Casino has the most reviews with “bugs” in it, and you can look at the underlying dataset and actually see all sorts of people writing about bed bugs and things like that. My plan was to actually show a different word here that has more data, like, I was gonna use “gross” and it’s actually a different … Because it’s the Flamingo Hotel, but it’s not that different, so yeah, and then … That’s just a short example of what this does. I’ll try one more time, and we’ll see if removing that network cable caused the DNS problems. No, I don’t know. I give up.Any other questions?Audience #5:
Would you connect to HBase directly or via Phoenix?Tomer Shiran:
That’s a good question. So, with HBase, do we go directly or with Phoenix? We support HBase directly, and we’re actually working on something called Drillix, which is an integration between Drill and Phoenix. It’s actually something that we’re working with the guys at Salesforce on. They wrote the Phoenix project, and so, we’ve had a number of hackathons with them, and basically … That’s actually working now. If you’re interested, that’s kind of in an alpha stage. There’s still some more productionization of that, but there’s information online.Audience #5:
Okay. What would be the benefits according to Phoenix, from your perspective?Tomer Shiran:
Yeah. What are the benefits of going through Phoenix? Phoenix actually has secondary indexes on top of HBase, and so, by going through that, we can actually take advantage of those secondary indexes.Audience #6:
Why wasn’t, like, Redshift or Oracle or SQL Server, things like that-Tomer Shiran:
Yeah, things like Redshift, SQL Server … So, what I found is that every single company I talked to has relational databases too, and yes, so you can actually … There’s a JDBC plugin, which allows you to connect to any source that supports JDBC. That includes Redshift and Oracle and those … Yeah.Audience #6:
Any idea to replace ODBC with some other DB? Get rid of all the constraints of ODBC?Tomer Shiran:
Which constraints?Audience #6:
Let me say stability, performance, number of connections, asynchronous/synchronous, let’s not dive into that. But I just wanted to know whether you have any ideas?Tomer Shiran:
I think the … So, I’m not an expert on ODBC and stuff like that. For the ODBC driver here, we’ve partnered with Simba, which is the company that basically invented the ODBC spec with Microsoft. They built it. It’s supposed to be very good. I think a lot of the bad perceptions around ODBC are related to the implementations of ODBC, and not the actual spec of ODBC. So, I guess you should test it out and see if those problems exist. But I know it has a bad rep. There’s also a JDBC driver that’s open-source that you can use.Audience #7:
You talk about Drill cluster. What are the requirements for the cluster?Tomer Shiran:
What are the requirements for the cluster? You can run it on a laptop, but obviously, that’s not the main … The production use case, right? The main thing is memory. The more memory you have, the better it’s gonna be. There’s never enough memory. Drill is implemented in Java, but does not use any of the Java memory facilities. It doesn’t use the heap. It’s all off-heap custom allocator. The more memory, the … So, ideally, you want the working set for query to be in-memory. Drill will handle it if it doesn’t. It will spill to disk, but then you get much slower, and so, the more memory you can throw at it, the better.Audience #8:
How do you handle semantic differences between this amazing stuff set out here? So, if you connected to run the same query against, say, Redshift, something else … Total different semantics. How do you deal with that?Tomer Shiran:
How do we handle different semantics between systems? What do you mean by … What kind of differences are you talking about?Audience #8:
You get different results. You write the same query, you get different results with the same dataset. That’s kind of expected, so-Tomer Shiran:
You mean with, like, relational databases?Audience #8:
With relational databases. It shouldn’t be the case, but it is.Tomer Shiran:
It shouldn’t be the case, right? I guess it’ll depend on what got pushed down in that case, right? One of the things we are very cognizant of is, with these non-relational systems that have some execution capabilities, so things like Mongo DB, for example, they have actually, in many cases, quite different semantics from what a relational database would do. Differences there are not like small differences, they’re big differences, right? In terms of … Especially handling of casting and implicit casts and things like that, so we actually handle in … The thing that we guarantee is that you’re gonna get the correct SQL result, right? That may mean that we may end up pushing less in some specific cases, but at the end of the day, correctness is more important, and so we ensure that you’re getting the result that you’d expect of a relational system. Even if these underlying systems, in some cases, would return different results.Audience #9:
what processing? There’s no point in doing that thenTomer Shiran:
I heard the first part, which was, do we have diagnostics, but not the second part.Audience #9:
The second part is then, do you have any kind of capability?Tomer Shiran:
Yeah, so there’s all sorts of diagnostics. There’s actually … I wish I could have shown you that, but there’s a … It shows you a nice kind of graphical, a DAG basically, with all the different operations, so you kind of [inaudible 00:39:42] the scan and the parallelism at each of these layers, and you can understand really detailed metrics around each … There’s also the parallelism, right? So, typically, you have the operator graph, but you also have the parallelism on each one, and you can drill down to each of those and see, okay, is there anyone that took more time than the other? And kind of drill down into why that was.Audience #10:
To understand how Arrow…-Tomer Shiran:
Mm-hmm (affirmative)Audience #10:
Does Arrow replace the part with Drill, or-Tomer Shiran:
Yeah, yeah. So, basically, Arrow is … The internal … Some of the internal components of Drill are replaced by Arrow, as are some of the internal components of these other projects, like Spark and Kudu and Impala. You can already see … Cloudera is also doing a lot of work … They’re working with us closely on the Arrow side of things. Wes McKinney, who’s also the creator of Pandas, is now an engineer there. He basically created the … Or is working on the C implementation and the Python implementation. That’s really nice because now you get the ability to utilize, for example, Parquet files in all these different languages, and embed code from, say, Python into Drill queries as UDFs or take the results of a Drill query and load that as a Pandas DataFrame with no serialization and deserialization.Audience #11:
Are you finding any resistance from any of the vendors?Tomer Shiran:
With Arrow?Audience #11:
Yeah, with not being competing in some areas, or some similar functionalities-Tomer Shiran:
No, I think the reality is none of the big data open-source technologies that you’re familiar with do columnar in-memory processing, and so, Arrow actually brings a tremendous speed … Even ignoring the fact of all the integration that this enables between systems. Just the fact that their queries will run much faster is a benefit, and so, what happens is that Drill had some of these capabilities internally, and then we were approached by folks working on Spark and Impala saying that they wanted to add columnar in-memory processing, and so … The nice thing about all these things is they’re all under the Apache software foundation, right? It’s much easier then to share code between systems that collaborate. So, basically, we all came together and created a new Apache project, so that … And already, you see the … A lot of work coming from both Dremio, as well as Cloudera, and some guys in the R community. This is just in the first month, already contributing a ton of code. It’s pretty cool.Audience #12:
Can you elaborate on how’s the query optimizer working? How does it decide on what to push down or not push down-Tomer Shiran:
Yeah.Audience #12:
Yeah, and if you have chores where you need to return the results of the table, so the results might get pretty large. How do you handle that?Tomer Shiran:
Like fact table joins you mean?Audience #12:
Yes.Tomer Shiran:
There’s an optimizer here and it’s both … There’s both a real base component to it, and then there’s a full cost base optimizer as well, and so, it’s making decisions based on these costs. Depending on the nature of the query and the level of statistics that we have about the data, and so, we can have more or less, right? In some cases, we only have number of records in the table. In other cases, in let’s say Parquet, we have more metadata as part of the Parquet file. So, the optimizer makes a decision based on that, and in some cases, it’ll decide not to push something down to the source system, even though it’s possible to push it down just because we know that we can run it much faster at our layer. Yeah, I mean, there are people that have done enormous fact table joins here and produced trillions of records.Audience #12:
Just to extend on a question earlier, in which cases would you use Arrow, and which ones would you use Drill?Tomer Shiran:
When would we use Drill and when would we use Arrow? Arrow is an internal technology. It’s a specification of a columnar memory format, and then a set of implementations in different languages: Java, C, Python, etc. Drill uses Arrow internally. It’s not … But it’s not, I guess maybe that’s not entirely accurate, but in general, Arrow is not intended for an end-user to use. It’s more intended for the developers of all these systems to use. Now, there are some exceptions. There’s, like, Python and R now support a file format called feather that’s basically a dump of Arrow to disk. In some cases, a user does interact directly with us.My understanding is they both want to do this by the end of the year, but I don’t know specific version numbers.There’s support for Python, R, Drill, and then the other ones that are coming this year I know of are, I think some of it is Kudu, Impala, Spark. I don’t know if I forgot any. Maybe Cassandra. I think those are the ones.Audience #14:
Is this code, so can I [inaudible 00:45:19] the code-Tomer Shiran:
Yeah. These are all Apache projects, so it’s all done in the … If you’ve ever interacted with any of these Apache projects, there’s kind of a process to submit code to them, but yeah, contributions are welcome on any of them. And they all have multiple … Both of these projects are not … I mean, some Apache projects are largely driven by a single vendor. These are actually projects that have more than one company that are driving it kind of full-time. One more?Audience #15:
In Java, it is difficult to have control on the memory layout [inaudible 00:45:56] do you have to use, like, API or stuff like that?Tomer Shiran:
In Java, how do we control the memory allocation? There’s actually … There’s kind of a hierarchical memory allocator based on malloc. So, yeah, it doesn’t use … There’s no Java objects being created. It’s all off-heap, otherwise, it’d be a disaster in terms of performance. There’s a lot of other technical aspects here. It’s all doing … Like, when you actually run a query, it’s doing runtime code generation and compiling the query on the fly as it’s running. So, there’s all sorts of capabilities around that. It’s not interpreting the … Yeah. One more?Audience #16:
Do you have any experience regarding numbers of parallel users that are using this? for example? Is it more, like, is it more corporate?Tomer Shiran:
Oh, parallel users?Audience #16:
Or can I use it, for example, to put [inaudible 00:46:53]-Tomer Shiran:
Yeah.Audience #16:
Accessing different databases … Are there any limitations?Tomer Shiran:
Yeah, what’s the kind of concurrency? The concurrency is not designed … It’s not designed for, like, transactional workloads where you have hundreds or thousands of queries per second. It’s really designed for tens of concurrent queries running. More like an MPP database, as opposed to a transactional, and OTP database. Cool. Thank you.

The Heterogeneous Data Lake

Table of Contents

Webinar Transcript

Unlock the Full Potential of Heterogeneous Data Lake: Power Your AI Initiatives with Trusted Data

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?

Table of Contents

Webinar Transcript

Unlock the Full Potential of Heterogeneous Data Lake: Power Your AI Initiatives with Trusted Data

Additional Resources

What Is Apache Iceberg? Features & Benefits

Apache Iceberg: The Definitive Guide

Introduction to Data Engineering

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?