The Apache Iceberg Advantage
An in-depth review of Apache Iceberg, an open table format for enterprise data lakes.
read moreKelly Stirman:Thank you. Good morning, everyone. Thanks for joining. Today we’re going to talk about the idea of making BI work on a data lake. I work for a company called Dremio. We’re going to talk about some of the concepts and patterns that we see for companies embarking on this journey to make the data lake a key part of their analytic strategy. We’re going to talk about a number of different technologies including a new and very interesting open source project called Apache Arrow. We’ll also take a look at an example of a technology that’s working to make BI on the data lake a reality and that’s an open source project called Dremio. We’ll actually spend a fair bit of time looking at that technology at work because I think that’s frankly a lot more interesting than seeing slides.Finally, I’ll just encourage you to ask questions along the way. We’ll do our best to get to those later in the session. It makes the conversation I think a lot more fun for folks if you can ask some questions along the way. Just briefly about me. My name is Kelly Stirman, I’m the VP of Strategy here at Dremio. I’m a technologist at heart and started my career as a DBA working in data warehousing and BI in the 90s. Then over the years have progressed from the traditional relational world when I worked at Oracle to a number of different innovative technology vendors who have been providing alternatives to the traditional relational database with MarkLogic and MongoDB. MongoDB of course IPOed last year and has been a really great success story. MarkLogic has been a great success story as well.Now, I work at a company that’s doing something really interesting in the area of analytics and the concept of self-service data and you’ll get to see that at work. The point is that I’ve been working with companies and with technologies on the cutting edge of analytics for most of the past 20 years. Just to give us a backdrop to the conversation today, I think pretty much every company has some initiative and maybe multiple initiatives in the area of the enterprise data warehouse. For many companies, these are projects that have been underway for more than a decade, in some cases, multiple decades. Over the past five plus years, companies have really started to reconsider some of the newer alternatives available for re-platforming their analytics.If we think about what those are by and large is that really high level, it’s taking the workload that we’ve been doing with ETL tools and the enterprise data warehouse and all the different BI tools that we use from traditional products like Cognos and Business Objects and Microstrategy to some of the newer technologies like Tableau and Power BI and Qlik. The tried and true approach of building a comprehensive data warehouse, building a domain or a topic specific data marts and moving the data throughout those different environments using ETL, I think that’s a pretty familiar concept to most folks. The idea of re-platforming is, “Hey, there’s some new interesting products out there that we are calling the data lake that give us some real advantages in terms of how we think about running our analytics.”The fact of the matter is that the complexity of data, the scale of data, and the speed of data movement is so great now that it’s very challenging to embrace the opportunities presented by data using the traditional tools that we’ve been using for 30 plus years. Is there a new way? Are there new technologies that make that easier for us, that make that more cost-effective and give us more flexibility and agility in our business to deal with opportunities presented by data? That’s what technologies like Hadoop are about and a number of different data lake cloud services where if you look at cloud vendors like Amazon or Microsoft, there are services that give you these kind of capabilities like Azure Data Lake Store.Everyone I think is looking at this and either it’s something you’re thinking about or it’s something you may be a few years into a data lake re-platforming exercise. This is really what I want to talk about is what are some of the challenges along the way and what are some of the opportunities to really capture and capitalize on the opportunity that the data lake provides. To me, the biggest challenge, there’s lots of things around these data lake initiatives that companies talk about but the biggest one of all is that BI users are getting left behind. That is to say, the data lake makes it easy for us to load data. It’s not nearly as much effort as a traditional enterprise data warehouse. I can load data in its raw form without going through exhaustive ETL to get it into the data lake.Number two, it’s much, much easier to scale. I can easily go from one terabyte to one petabyte if I want, and yes it will be more expensive but the system is designed to make that very easy to do. It lets me deal with variability in my data in a much easier way because the fundamental storage medium of the data lake is a file system not a rigorous relational database. For those benefits, when it comes to actually accessing the data and analyzing the data, the data lake is missing the kind of performance and integrity that I have come to expect from my enterprise data warehouse when using one of those BI tools that we talked about before. As a result, what companies end up doing is moving data from their data lake back into a data warehouse or a data mart to get the kind of performance that the BI tool needs.
As that starts to happen, it creates a situation where the BI users become very, very dependent on IT because anytime they want access to a piece of data, they have to ask IT to put it somewhere where they can access it. They may not even know how to find it. That nice era in the 2000s of self-service BI and people being able to do things themselves instead of waiting on IT to build cubes for them. We’re back in the 90s where any time a BI user need something new, they have to go to IT to do it for them. The data lake also because it’s fundamentally a much higher latency environment than a traditional enterprise data warehouse, it means that you can’t really accomplish the kind of interactive analytics that are so essential to most analytical jobs.Somebody ask a question, they start to explore an answer and that leads to the next question which leads to the next question, leads to the next question. If each of those steps along the way takes tons of minutes to get to the next step, it interrupts that chain of thought and it impedes an endless ability to do their jobs. We also find that in the data lake you have lots of data structures that don’t really work with BI tools. Essentially, every BI tool assumes the data is in a relational format and when it’s not, the tools either don’t work at all or they just don’t work very well. When you look at the different places people have data in their data lakes and some of the other newer systems like NoSQL systems, you see a lot of JSON formats like Parquet, and Avro, and ORC that are incompatible fundamentally with the world of BI which again goes back to IT has to go move that data into a structure and format that is suitable for the BI tools.Finally, the people responsible for these environments are busy. They have lots of things that they’re working on and so when a new request comes up you’re basically standing in a data breadline and you’re hoping your number comes up some time soon but there’s a whole bunch of people in line ahead of you and the people that can fulfill your request are very, very busy putting out their own fires. This is the scenario we see over and over again with companies who have embarked on their data lake journey but unfortunately the BI users are getting left behind. I like this quote from a CIO at Tier 1 Bank that I heard last year, “Somehow we lost the vision of self-service along the way.” Talking about their data lake journey. Here’s how you might start to think about solving that situation in the data lake with different technologies that are out there today.Just to put some icons and pictures together to make this conversation a little bit easier, I have a data lake and that might be on AWS based on S3. It might be on prem in a Hadoop cluster. It might be on Azure and ADLS, but that’s whether you’re in the cloud or one of the [store 00:12:35] providers are on prem, the story is the same. I have these source systems where my operational applications are generating data that I then move into my data lake and that might be coming from relational databases, it might be coming from SQL systems. It doesn’t really matter. Most companies have a mix of technologies they’re using to run their business and then that data arrives in the data lake in a raw form.In its raw form it is somewhat available to more advanced users like data engineers and data scientist who use technologies like Python, R, and Spark and command line SQL with high latency. To the BI users, this data is simply unavailable because those tools are incompatible with the underlying structures of the data and the latency is too high to really be useful. What companies starts to do is, “Okay, well, we have data in this raw format so let’s get a data prep tool,” these are tools designed to help you get your data ready in a data lake, you would think of it at a high level as ETL but designed specifically for the data lake. Once you start to marshal the data into these more refined forms, it begs the question, what do we even have in our data lake to begin with.You get a catalog technology to help you inventory the raw and refined data sets so that you know what you have to work with and then you start to chip away at the performance aspect of things by looking at a BI acceleration technology. There are number of different products out there but they basically take the data in your data lake and copy it into a format that is high performance and will let you deploy your BI workloads on the data lake. That doesn’t solve everything. You also look at ad hoc acceleration of your queries because while those cubing technologies of this BI acceleration products work for OLAP type queries, they don’t really work for ad hoc queries. You need a separate way to go about accelerating those workloads. Typically, that is a relational database that sit side by side with your data lake. That might be Redshift, if you’re on AWS.
It might be Teradata if you’re on prem. At this point, once you’ve assembled these different tools and figured out how to integrate them together, you begin to open up these data to your BI users but you’ve created I think a pretty complicated scenario where the BI user needs to know, “Okay, I want to go to the catalog to find something and then once I find it, do I connect to one particular technology for OLAP workloads and a different technology for ad hoc workloads?” Then your data engineers and data scientist are doing some things in data prep and other things through other interfaces. I didn’t draw all the arrows here but it’s a fairly complicated scenario that I think puts a lot of burden on the BI user to figure out for a given task what’s the right sequence of tools to use.By the way, all of these different technologies are proprietary, have their own licenses, they are not really integrated together so it also puts a large burden on IT to make this whole thing work. That is some of the options that you see out there for companies embarking on this data lake journey today to try and provide access to the data in their data lake to their BI users. What Dremio is all about just to give you an alternative approach to this and a way to think about this a little bit differently is to say there has to be a better way than backing up the delivery truck and dumping five or six proprietary technologies on IT and hope that they can assemble that themselves. We started the company a few years ago to embark on this vision of giving people what we call self-service data.We said there’s a few key things this technology would need to do to really transform the opportunity for data in the data lake and frankly, probably outside the data lake as well. First of all, we need to work with any data lake whether you’re in Hadoop cluster, whether you’re on AWS or on Azure or Google. Wherever you’re running your data lake, whatever the underlying technology is, Dremio would need to work with it. It would also need to work with any BI or data science tool because no company has a single BI technology they use. Different departments have different products and data sciences use different tools and platforms as well. It needs to work equally well with all of them. It would really fundamentally need to solve the data acceleration challenge because if the data is slow, it might as well not even be accessible in many cases for many workloads.
With Dremio, you’re able to deliver 10x to a 1000x acceleration of the data on the data lake to get the interactivity no matter what the underlying size of the data is. Next, you need a way for the users of the data to define their own semantics and their own meaning around the data. How often do you go look at data that’s provided by IT? The name of a column is something like C003, what the actual meaning of that column is this customer name and it’s up to you, you know what that column means and what it means to you in the business, you should be able to record that and manage that information and share it with your colleagues and define and manage a semantic layer in a self-service model. Next, it would need to make transformations and blends and aggregations and joins of different data sets available to a business user without making copies because copies of data is an enormous governance and security challenge for companies.It adds a lot of cost and overhead to IT so how do we deliver exactly the right version of data and representation of data for a particular need without making copies. It would need to be elastic because companies are growing in their use of data, so it needs to go from a few nodes to over a thousand nodes very easily. Finally, we think something like this has to be an open source technology. At a really high level, what Dremio is is a product that sits in your data lake that allows your data scientist and data engineers as well as your BI users to access all the data in your data lake. It allows for acceleration of the data to get the interactive speed of access that your users need. It’s delivered in a self-service model so that users can do things for themselves instead of being dependent on IT.It provides a rich semantic layer that the business can create and manage themselves. It provides for data curation capabilities and it automatically tracks data lineage through all of these different analytical processes and all the different tools. It’s really a very different approach to assembling five, six, seven different technologies together that are all proprietary instead it’s one open source platform that integrates these capabilities into one solution. That’s just a quick overview. We’ll take a look at how Dremio works in just a moment but one of the things that’s important in this bigger journey and this is not specific to Dremio but it is something that is a key part of our architecture is an open source project called Apache Arrow.
I want to talk just a couple of minutes about what Apache Arrow is to give you a better sense for a project that has become incredibly important in data science and analytics and has grown dramatically in the past year in its adaption by different projects and different companies. Let’s take a look at Apache Arrow. We have an age-old problem in the world of analytics of data being in lots of different formats, lots of different systems, number one. Mix of relational databases, Hadoop, NoSQL, cloud services like S3. It’s all over the place, but we’ve learned that with companies like Teradata and Vertica, in the early 2000s we learned that columnar data structures provide a massive advantage in terms of efficiency of analytical processing. A traditional relational database is row oriented but analytical databases like Vertica and Teradata are columnar in nature.The advantage can be orders of magnitude improvements in efficiency of running analytical queries. There’s a lot of reasons for that. We’ll look at a few in just a moment but we know that columnar is the right way to organize data for analytics. We also know that in recent years RAM has become dramatically less expensive and the performance difference between doing in memory data processing versus frequently going to disk subsystems is multiple orders of magnitude difference. That’s independent of the advantages of columnar. If we can keep the data in memory, we can really speed up analytical processing. In more recent years, there has been an embracing of GPUs for analytical workloads. The reason for this is that a GPU can have about a thousand times as many effectively course to process analytical workloads.When you have an analytical job, can be broken down into lots of smaller jobs that run in parallel which is very frequently the case with analytics. GPUs can give you enormous advantages in performance. GPUs have become very popular in data science workloads and increasingly in some innovative GPU databases that allow you to do analytics much more quickly. You can combine in memory and GPU and columnar those are all three independent of one another, but they can all go together very, very nicely and build on each others advantages. Then finally we have distributed capabilities today. If you want to take advantage of resources in the cloud, you don’t buy larger and larger servers. You don’t provision larger and larger instances. You provision more instances. It’s this concept of being able to provision effectively thousands of instances as you like that gives you unlimited scalability.
If you’re able to use those resources as a pool and use software to make that pool of resources appear as a single resource, you can effectively have a server as large as you like. Combine columnar with in-memory and GPU and distributed processing as a way to solve the challenge of accessing data from all of these different environments without first copying the data into one central repository. That’s what Apache Arrow is all about. If you look at the traditional way of doing analytics, you have different processing environments so things like Apache Spark, Python for data science workloads, SQL engines like Impala and Hadoop or Dremio that need to access data from different storage systems, be those Parquet files in a file system databases like Cassandra, an age-based relational database, as NoSQL database et cetera, et cetera.As a processing environment accesses data from one of these sources, it has to copy the data that is made available from that source in memory into a representation that it understands and then begin to work on the data in its own in-memory environment. For example, if I’m using Python to build a machine learning model and I want to pull data into that model and iterate on my model. I’m going to first connect to the source of that data, number one. Number two, that source will read the data into memory. Number three, my Python application will make a copy of that data and marshal it into its own memory representation and along the way it’s going to serialize and de-serialize the data a few times. It turns out that that process of copying and serializing de-serializing can be 70 to 80% of the CPU overhead in these jobs.
Basically, every one of these points in the picture has to reinvent the wheel themselves. The analogy I like to use is once upon a time when you went to Europe to go on vacation you are going to do five countries in seven days and when you got to the next country you are going to be at passport control and you’re going to wait in line and then you’re going to convert your currency from French Francs to Swiss Francs or what have you and you knew you’re going to lose a few hours on the border of each country and you’re going to lose money in the conversions. That is the world without Arrow, whereas the world with Arrow is being in Europe today where there is no passport control between the countries and there’s one currency you can use everywhere and so you just go. It’s a huge speed up.Arrow is a standard that all of these different environments can agree upon that eliminates the copying of data, everyone can share the same in-memory buffers and it eliminates the serialization and de-serialization of the data which can be a huge performance advantage. About a little over a year ago IBM committed the code to the Spark project adding Arrow support for PySpark which improves the performance of running those Python jobs via Spark 55x. Just by moving to Arrow a 55x performance advantage. That’s a little bit about the concept of Arrow is it’s a standard that everyone can agree upon and it’s become very, very popular. These are some of the projects that are using it today, mix of machine learning platforms, GPU databases, Python pandas library.Of course, Dremio uses it, it’s grown by about 40x in the past year to be a little over last time I looked a 100,000 downloads a month. What’s interesting is this standard is becoming more and more pervasive across different projects as you have the emergence of something called Arrow kernels which are libraries that are available with Arrow that do optimal low level things like sorting the data and finding distinct values and other things that instead of everyone having to invent those libraries themselves, it’s one standard that everyone can agree on. It also lets the hardware manufacturers like Intel and Nvidia provide hardware optimized versions of those libraries so if you happen to be running on a GPU or you happen to be running on a specific Intel chip set, there are advantages to be had that are specific to that environment.
Lots of interesting exciting things around Arrow going on. Just briefly, Arrow is a project that Dremio started a couple of years ago. We lead the project and it’s core to our engine, it’s core to our platform. If you think about Dremio as a car, Arrow is the engine and we built the rest of the car course around the Arrow project. Okay, that’s lots of me talking. I thought it would be fun to just take a look at an example of this at work. Let me switch over to my browser here. I’m now logged into Dremio just through a standard browser. I thought we could look at a couple of examples here together. Let me orient you to what you’re looking at. First of all, this particular environment is a Dremio cluster, a small four node cluster running in Google cloud. It just happens to be where it’s running.It could be running in your data center. It could be running in your Hadoop cluster. It could be running on AWS or Azure, it doesn’t really matter. Dremio is a software that you run and manage yourself. It’s not a cloud service. In this environment I have connected several different sources. I’ve connected my data lake of course, but I’ve also connected some other sources that aren’t in my data lake like Elasticsearch and MongoDB and some relational databases that are outside of data lake, Oracle and Postgres and even just for fun I’ve connected to Redshift over on Amazon’s cloud. I’m actually able to reach across clouds from Dremio to access data. The point I want to make here is that if you’re building a data lake, Dremio fits very naturally in that data lake and lets you analyze and work with data that you’ve already moved into the data lake but you may also have data in other sources that isn’t yet in your data lake and Dremio can connect to those and work with those as well.What we have above are what in Dremio we call Spaces and this is the self-service semantic layer where I can design a representation of data that make sense to the business. The business can name columns and data sets what they like they can describe them however they like and all that information is automatically captured in Dremio’s catalog that is searchable and makes it easy for users to find and share different data sets for different tools and different jobs. In addition, above the spaces, every user has what we call a home space which lets a user upload their own data like you can imagine for example an Excel spreadsheet and then they can join that Excel spreadsheet to enterprise data sources without having IT be involved. That’s a very, very handy feature for folks.Let’s go through an example. Let’s say that I am a Tableau user and I have been assigned a job of analyzing taxi rides in New York City to explore the impact of services like Lyft and Uber on taxi rides in New York. I’m not an expert in taxis. I’ve been in a lot of taxis in New York. Let’s say I’ve been assigned that job and I want to begin to work on that and start to ask questions and use Tableau to understand and visualize and make sense of that data set. First thing I need to do is connect to the data from Tableau but I don’t actually know where the data is. I haven’t been told by IT yet. I want to get started right away. One of the nice things in Dremio is when we connect to these different data sources we automatically discover schema and build that into hour catalog. That catalog is indexed and searchable.I could for example go into the search box up here at the top and start typing trips and get back a set of search results where the search results correspond to different data sets that Dremio knows about. I’m going to pick the first one here and open that and jump right into a sample of this data set. I didn’t have to wait for IT to send me an email to tell me where the data was. I didn’t have to figure out some way to connect to it to inspect it and see if it’s what I was looking for. In a single click I’m now in a sample of that data set and able to look at it to see is this what I’m looking for. Here, what you’re seeing is something looks familiar, looks I would say something like Microsoft Excel where I’m able to visually preview and get a sense for this data set.In this data, each row in this table corresponds to a taxi ride in New York City. We go over the columns here, you can see there’s a pick up and drop off date time and this little icon say that this is a date data type. The number of passengers in the taxi ride and that little pound sign tells you it’s an integer. The distance in miles in this case and so the pound dot pound tells you this is a float data type. Longitude and latitude for the trips. If I scroll over I can see a breakdown of the fare. There’s the total of that ride, how much was paid in tolls, how much was paid in tip, the tax, and any surcharges and the underlying fare that was calculated based on the distance of the trip. That’s the data set that I found based on my search and as a Tableau user, there’s two things I could do here.First, I can say this is exactly what I’m looking for. I’m ready to start analyzing it with Tableau. Second, I could say, you know what, this isn’t exactly the data I want. I want to do some work on the data before I begin my analysis and that’s something I want to do myself. Here I’m going to click, we’ll look at both of those scenario. First let’s start by saying, “Hey, I’m ready. This is it. This is the data I want. Let’s start to analyze it with Tableau.” Dremio supports any BI tool. There’s a few tools we have more advanced integrations where we can basically launch the tool connected to a data set. If I click this Tableau button it will set up a connection from Tableau to Dremio using standard OBDC and allowing me to log in to access this data using my LDAP credentials.I can come in here and say my username and password from Tableau and let’s just see what I have in terms of number of records to start. I drag this up to Tableau shelf and I can see this is about a billion rows of data. It’s not a massive data set but it’s not trivial and it’s something certainly larger than I would have on my local laptop or workstation. Let’s start to work with the data and see what we have. I can take those drop off date times. We change this visualization to be a little bit easier to see. I can see now that there’s about a 170, 160 million rides per year between 2009 and 2015 and there’s not a whole lot of data in 2015, so it looks like I just have maybe partial data set for that year. I can look at the total amount that people are paying in these taxi rides and that’s the sum which is just going to be proportional to the number of rides.Let’s look at the average of total pay per taxi ride and I can see that the number of rides year over year is relatively flat but it looks like people are paying more since 2009. It looks like the trend has been going up and then even though it’s a partial year in 2015 it’s significantly higher. Let’s see what effects the tipping has on that. Again, I’ll change this from sum to an average. Let me change the color to be a little bit easier to see the differences here. The low end of 57 cents is red and the high end about 3x data dollar 69 is blue and you can quickly see that it looks like people are tipping more and maybe the tipping is the main component that’s driving up the average amount people are paying per fare. Each one of these clicks in the background is a live SQL query back to Dremio and then Dremio is running that query in the data lake on the CSV files that are sitting in HDFS in this example.
In the background, the data is in its raw form and it’s CSV files, a few thousand CSV files representing I think about a half of terabyte of data and instead of moving those files into an enterprise data warehouse to get the access and speed that I need here, instead of building a queue and instead of building an extract for Tableau or whatever BI tool I’m using, I’m able to query the data directly and get the performance that I want. As you can see here, all these clicks are coming back in about a second. I’m getting the performance that I need to do my job using my favorite BI tool, but I’m able to use the data in the data lake, and I’m able to leverage the data lake infrastructure to get the performance that I need which is exactly I think what most companies would like to do is take advantage of the flexibility and scalability of their data lake and that elastic infrastructure but get the performance that they’re used to getting out of their enterprise data warehouse or cubing technologies or other things.
That’s using that first scenario that we talked through which is I came into Dremio, I searched for a data set, I quickly found what I was looking for and then I was able to connect to it with a single click using my favorite BI tool and have a great experience with the data in terms of how fast each of my queries was and getting all the features I’m used to being able to use in whatever tool is my favorite tool for performing analysis. That’s the first thing. The second thing is the second scenario we talk about is well, what if the data isn’t exactly in the shape or organized the way I need for the work that I’m going to do? In this scenario typically a BI user would go back to IT and say, “Hey, I need data that meets the following requirements. Can you put that together for me?”Then maybe weeks or months later IT would come back and say, “Hey, what do you think about this?” What I think users really want is to be more self-service in terms of the data and how they access it, how they blend it, transform it without being so dependent on IT. Let’s take a look at an example of that. Here I’m going to go and let’s create a new space. I’ll just call this new space and I’ll pin this to the top of my list here so now I have a new space and there’s nothing in it. I’m going to go into one of the sources here. I’m going to reach into this Postgres database. It’s not actually in my data lake and go look at this HR data set. This purple icons tell me that I’m connected to a physical source at this particular source that I’m connected to.Open the employees table and if I look I can see first name, last name, email address, phone number, hire date. Let’s do a couple of quick things here. Let’s say I want to focus on just the senior employees in my company, not everybody but just the senior employees. I can look at this hire date column and there’s two menus at the top of each column. The menu on the left lets you convert between different data types. The menu on the right lets you transform the data based on this column. I can click here and say, “You know what, I want to keep only these employees.” When I do that, hire date column here is highlighted in blue below and what’s presented above is a histogram of the start dates of these employees. I could just slide this over to zero in on employees that were hired before a certain date and as I slide that over, the rows here below are updated dynamically to give me immediate feedback, “Is this the kind of change that you want in the data?”I can say yes, that’s what I’m looking for. There’s literally thousands of things that you could do to transform this data, anything you could imagine in terms of converting data types, calculated fields, transformations, conditional case statements, all that stuff. We’ll just do a couple more things here to keep this brief. Let’s say I don’t need this employee ID column. I can just say [drop 00:43:42] this and if I want to rename one of these columns I can simply click here and just update the column to be whatever I want it to be. I’ve done a few simple things here but let’s say I want to blend the data about my employees with data that I’m managing in a different system that represents the departments that they’re working. That data could be in my data lake, it might be in a database in a different environment altogether.I’m going to say join and here Dremio is going to recommend some joins to me and it knows about these joins to other data sets because it’s learned from the patterns of use of different users and different tools and it knows how people are combining different data sets together and can say to you, “Hey, how about this one? This one seems popular. Would you like to try it basically?” This first recommendation is data about departments coming from actually a Redshift environment. Even though my employees are in a Postgres database that’s running in Google cloud, I can simply click this button and now get data about the departments coming from a Redshift instance over on Amazon’s cloud. Now I got the department names here combined with data about employees and I can save this as my senior employees and put this in my new space space.What I basically done behind the scenes now if I go into that new space environment that was empty before, I’ve now created what we call a virtual data set. I haven’t moved any data from Postgres or Redshift. Everything is where it is today but what I’ve created is a virtual data set that makes it so I can connect this data to any tool just like we did with Tableau. I’ll do like we did before, click the Tableau button that will launch a new instance of Tableau. I will log in with my LDAP credentials like I did before and so let’s look at the department name and let’s look at salary to see who gets paid the most in this company. You can see that it looks like sales gets paid the most but shipping is paid the second most which doesn’t make any sense to me.Then I realized this is the sum and I need this to actually be an average instead, so I’ve changed that to average and now I can see on average, executives are paid the most and shipping is paid the least which makes more sense. What just happened? What just happened is Tableau issued a sequel query back to Dremio over ODBC and if I go look into the history of queries that have been running in the system I can see this is exactly the query that was run by Tableau on the virtual data set we created called my senior employees in the new space. I can see the query came back in under a second, and it was issued by a username Kelly using an ODBC client. If I actually go back into this virtual data set that we created, one of the nice things that Dremio does in the background is it preserves the lineage of these relationships.
We created my senior employee, here is the schema for the virtual data set that we created. You can see all the column names and the different data types. You can also see that this is derived from a physical table in Postgres called employees and a physical table called departments that’s in Redshift. We preserve this relationships so that if you want to understand how different data sets are being used across different tools and different departments that’s very easy for you to analyze. For example, I want to see what other virtual data sets are descendant from this physical table called employees in Postgres and one click I can now see all the different virtual data sets that have been created by different users. If I wanted to see, “Here’s one I didn’t know about.”This one called my emphasize Gartner and one click I can see every query that’s ever been issued against that virtual data set, who issued the query, what the query was. Dremio even keeps a copy of the results of that query for a configurable amount of time, so you could effectively travel back in time and see what did this user see when they run their query. It gives you a really powerful way to audit and understand the data that’s being used by different tools and different kinds of context without making copies of the data. We call this our data graph. That is just a very quick overview of the kind of things that you can do in Dremio as an open source platform to make your data lake number one, fast for any kind of BI tool without making cubes or extracts or moving the data back into a data warehouse or data mart.Number two, a self-service experience so that people can easily search and find data sets. They can launch their favorite tools connected to those data sets. They can build their own virtual data sets. They can collaborate with each other as teams, they can create and manage their own semantic layer around the raw data that’s in data lake. Lots of great features and functionality. Then number three that you can combine data in your data lake with data that’s outside of the data lake. That lets you take advantage of the computing resources that you’ve allocated to your data lake without first moving all of your data into the data lake in the first place. That can be a really powerful capability when you consider that the data you have is in so many different systems. It’s probably impractical that it will ever be all in your data lake.
That’s probably a good point to wrap things up and hopefully the demonstration has been interesting. I’ll just leave this last slide to say, if Dremio seems interesting to you or you want to take a closer look, we have a download available on our website that is something you can try out on your laptop. It’s really designed to run in clusters of dozens, hundreds, even a thousand or more nodes but we’ve also designed additions that make it easy for you to try out on your Windows or Mac laptop. We have extensive documentation at docs.dremio.com. Lots of tutorials so pick your favorite tool whether that’s Tableau or Qlik or Power BI, Python.Pick your favorite data source, relational database, NoSQL, Hadoop, we have tutorials covering lots of those permutations. Then we have a vibrant community site where you can ask questions and get help from the community as you start to explore and understand what Dremio is. Let’s go see if we have any questions. Let’s see, I’m not an expert at using this panel here. Questions. Any questions from folks in the audience? PASS Representative:You can’t see them, Kelly? Kelly Stirman:I do not see any questions. PASS Representative:Okay. I have maybe two questions only. Couple other ones were asking about, can you maybe be able to provide the slides? Kelly Stirman:Yeah, happy to share slides after the conference. PASS Representative:We’ll send the slides everyone, and also I’ll be uploading the webinar to the YouTube page like always. Let’s do the questions real quick. First one, can Dremio be used on top of data hosted by an on prem SQL server? Kelly Stirman:Yes. SQL server, Oracle, Postgres, MySQL, those are all supported databases that Dremio integrates with. PASS Representative:All right. I guess this is a question to see if they’re understanding it correctly. The virtual data set you created in Arrow and Dremio is a tool to use to query those in Arrow? Kelly Stirman:Yeah, that’s a good question. How does Arrow fit in to the picture? First of all, a virtual data set is just that. You can think of it somewhat like a view in relational database. It is not a copy of the data, it is a way to access the data in a way that is potentially different from how it exist physically. In our example, we dropped the column, we filtered the data out, and we joined it between a couple of different data sets. We didn’t move any data to do that. We didn’t copy the data into Arrow. It acts like a view in a relational database so any time a query comes in against that virtual data set then Dremio would take that query, it would send a part of it to in our example part of it to Postgres, part of it Redshift, and read those results back from those sources into Arrow.Then do any kind of in-memory processing in Dremio’s distributed environment on the data and Arrow buffers in memory. Then the results would be streamed in Arrow buffers back to the ODBC client and then the ODBC client would convert the Arrow buffers into the data representation that make sense for your BI tool. A virtual data set is purely virtual until a query hits it and then the data is read into Arrow buffers for the purpose of executing the query. Hope that answers the question. PASS Representative:Does Dremio support Azure Data Lake Store? Kelly Stirman:Yes it does. There is if you go into let me just briefly show you if I come in here and connect to a new source, ADLS is one of the sources we support so as S3, Redshift, Elasticsearch, DB2, MySQL, et cetera, et cetera. There’s lots of different things and this is a list that continues to grow. If it’s not, a good chance it’s coming. PASS Representative:Does it have a connector for Yellowfin? Kelly Stirman:We do not currently have a connector for Yellowfin. PASS Representative:What’s the big difference between community versus enterprise edition? Kelly Stirman:Good question. To make that easy for folks, if you go to the download page for Dremio also you need to spell Dremio correctly. This is the download page and there’s a link here for Dremio Enterprise Edition. If you click on this, there’s a table that compares the two editions. The big difference is primarily around security. The enterprise edition integrates with LDAP and Kerberos. It also has some advanced administrative features that are not in the community edition. It connects to some sources like DB2 that are not available in the community edition. The community edition includes connectivity to Oracle and SQL server and open source databases but the enterprise edition is DBs and other commercial data sources. PASS Representative:I guess coming from this page, if someone was interested in the pricing model, would they have to talk to one of the sales reps? Kelly Stirman:Yeah, if you come to this page and you click this contact us for a free evaluation, there’s a small form you fill out and then we’ll be in touch with you to discuss commercial terms of our enterprise edition. It’s effectively something we license per server, it’s not per user or the amount of data that you have. It’s how many servers are running Dremio. PASS Representative:Okay. How do you add Dremio to an existing data lake environment? Kelly Stirman:It depends on the data lake that you’re using but let’s for example if I were running in a Hadoop cluster I would come to the admin screen in the provisioning and I would click Yarn and so if I’m running in Hadoop cluster I would just identify which distribution of Hadoop I’m running and then I would fill in this form. We use Yarn to provision to elastically provision Dremio within the Hadoop environment. If your data lake is running on Amazon or Azure then you would provision instances for Dremio and then connect Dremio to S3, to Redshift, or if you’re on Azure to ADLS and the different relational sources or other sources there. You would provision as much as many instances as necessary for the scale of data and number of concurrent queries you need to deliver to meet the SLA of your business. PASS Representative:Correct me if I’m wrong, but this integrates with Power BI? Kelly Stirman:Correct. PASS Representative:All right. Kelly Stirman:Qlik. Basically, any tool that generates SQL can connect to Dremio over ODBC, JDBC, or REST to take advantage of all the great features that we saw. PASS Representative:All right, two more. What are references of virtual data sets? Where are they hosted? Are they in the data lake? Kelly Stirman:The virtual data sets are not a copy of the data. They are simply, you can think of it as a configuration. We actually represent a virtual data set with SQL. If I went into my new space to that my senior employees behind the scenes this is actually represented by, this virtual data set is represented by the SQL query. That is the only thing that is effectively consuming any space or resources as you create these virtual data sets. That is stored in the data lake. Dremio’s acceleration capabilities use a really interesting feature called data reflections which we didn’t really have time to get into but those data reflections are also stored in the data lake. PASS Representative:Is there any additional information about Arrow? Kelly Stirman:It’s an open source project. There’s a lot of information available on the Apache Arrow website. There are resources available on Dremio’s website. If you Google for Apache Arrow, there’s actually a lot of stuff out there. PASS Representative:Okay. You may not answer this but is there any plan for adding support for Azure Cosmos DB? Kelly Stirman:It’s something we’re looking at very carefully with the market, with users of Dremio to gauge interest. There has been some interest thus far but currently it’s not something that we support. If there’s somebody out there that’s very interested in that, we would love to talk to you about adding support for that in Dremio in the near future. PASS Representative:Okay, here’s the last question as we hit the 1 o’clock. How do existing security protocols layer into the tool? Kelly Stirman:First of all, we in the enterprise edition supports LDAP so we would respect the access controls that you are already using in each of the sources. We also support Kerberos, something we’re using with some of your resources. Then in addition, Dremio has row and column based access control that we can manage within Dremio to layer on top of the controls that you already have in place. That lets you do things like mask data or control access in a very granular fashion based on the user’s LDAP group membership. You have a lot of options and flexibility there. PASS Representative:Well, I guess we can finish, there’s only one last question till you knock it out. I guess they want a little more explanation on the architecture of Dremio. Does it do any caching? Kelly Stirman:Yeah, that’s something, the concept of data reflection is something that we use Apache Arrow and Apache Parquet to accelerate the processing of queries and that’s relevant to the topic of caching. You can make sure number one, you have high performance and number two, you can ensure that Dremio offloads the analytical processing from the source systems that you’re connected to. It’s a big topic that we don’t really have time to get into today but in a follow up I’m happy to follow up with somebody who’s interested to hear more about data reflections. PASS Representative:Okay. All right. Thank you, Kelly and everyone. Just a reminder that we are recording it so I will upload it once I’m finished with it and it would be on our YouTube page. You can access it through pass.org and there will be a YouTube link. Please click on that, it will take you to the YouTube page for our chapter. From there you can see Kelly’s presentation. Kelly Stirman:Thanks, everyone. I enjoyed spending time with you and anyone that’s interested, I’d love to chat offline as well.
An in-depth review of Apache Iceberg, an open table format for enterprise data lakes.
read moreWatch Andres Bogsnes, Master Expert at Nordea Asset Management on how Nordea Asset Management journey to implement Data Domains with Dremio globally.
read moreGalp data strategy is built upon the pillar of democratizing data access and analytics, promoting decentralization when it comes to data product development.
read more