Dremio Jekyll

Software Engineering Daily Podcast June, 2018

Dremio CEO Tomer Shiran was a guest on the Software Engineering Daily podcast to talk about how Dremio is addressing the long-lived problems of data management, data access, and data governance within an enterprise.

Listen to the entire podcast over on software engineering daily.

Transcript

SE Daily:

All right. Tomer Shiran, you are the CEO at Dremio. Welcome back to Software Engineering Daily.

Tomer Shiran:

Thanks for having me.

SE Daily:

I want to go through a brief history of data engineering with you, because I think you know as much about it as many other people and ... well, not many other people, actually. You have probably a more authoritative understanding of data engineering than most people.

20 years ago, we didn't really have many tools for data engineering. Things were more standardized, in terms of the people who were using databases or the other BI tools that might have been available 20 years ago, and we didn't have that term, data engineering. So if we're talking about companies in the late 90s, what were the closest analogs to data engineering back then?

Tomer Shiran:

I would say that we've had databases now for a while, but the need to move data around and the need to tune those databases and make them work, that responsibility fell on the shoulders of database administrators, so DBAs, as well as ETL engineers. So we started to see products like Informatica that would allow people to, in a more off the shelf way, move and transform data, and so those are basically ETL engineers. And a lot of these things were done through scripting and things like that as well.

SE Daily:

The closest thing to data engineering was probably data analysis. You had a relational database, you had a DBA that administered it, the database administrator. If I'm the CFO of a large enterprise company back in those days, like 20 years ago, let's say Coca-Cola. I'm the CFO, I decide I want some aggregation of data, and it's all in a database, how did that query get carried out?

Tomer Shiran:

Yeah, so in that timeframe, basically you'd have the data in one database or data warehouse, and as a CFO you're a very non-technical person, typically, and your expectation is that somebody is creating a report for you. So you're actually not interacting directly with the data, so you'd typically go to somebody in IT and maybe submit some requirements, explain to them what it is that you want the report or the dashboard to look like, and they'd go do the engineering work to make that happen. They'd connect a business intelligence tool to that database and then produce that report. And so it'd be a very simple, kind of on the back end, because typically at the time, data wasn't that big, it was all in one place, and with the combination of one BI tool and one database, you pretty much could solve that problem, although you had to be technical because of the nature of the tools at the time.

SE Daily:

As we move along the timeline, 10 years ago, Hadoop was born and in its infancy, and if we fast forward from that point in time when the CFO of Coca-Cola asks for a report and it's a pretty straightforward process towards 10 years later where Hadoop comes out, how had the data engineering world advanced from the late 90s to the Hadoop timeframe? Had time just stood still between that relational database era and the point at which Hadoop started to make data engineering more of a widespread challenge?

Tomer Shiran:

Well, I'd say even before the rise of Hadoop in the data lake, we had data warehouses, right? So we had technologies like Teradata and, of course, Oracle and Microsoft, etc., where companies that had data in lots of different places would try to consolidate that data into a single data warehouse, and those were very complex projects, because a lot of up front data modeling and preparation of the data had to take place and you had to figure out what exactly people were gonna ask and people would build OLAP cubes and so forth, and so that's one thing that was going on even before Hadoop, is the rise of a data warehouse as a centralized relational repository of data, and a lot of effort went into creating those things and keeping them up to date, many millions of dollars, typically, for a typical enterprise. But in parallel to that also, we have the rise of much nicer tools from a BI standpoint. So for example, technologies like Tableau, where you don't need to be as technical to actually create visualizations. We started talking about Agile BI and self-service BI and making that easier.

Then Hadoop came along and originally developed at Google and later as an open source project at Yahoo, and then commercialized by various different Hadoop vendors, and the goal there was to really solve some of the problems that existed in the world of data warehousing, where it was just too hard to get data into those platforms. With a data warehouse, you had to figure out schemas and models and spend tons of time before you could even ingest a single row of data into the system, whereas the Hadoop environments came along and made it easier to get data into the platform and centralize it in one place. At least that was the hypothesis.

SE Daily:

Hadoop got stable around 2011. You were at MapR around that time, in the early days ... MapR in 2009 through 2015. In that early point in time of Hadoop's life, 2009, 2010, 2011, when you were working with those enterprises that were starting to adopt Hadoop, what did their infrastructure look like?

Tomer Shiran:

Yeah, I actually remember my first meeting with LinkedIn about Hadoop back in 2009, I think. They had a 20-node Hadoop cluster in the basement. They also had a 20-node ... it was called Astra Data. It doesn't exist anymore, but an MPP database as well.

Fast forward a few years later, they were running thousands of nodes of that system, of Hadoop. But companies, they had data in lots of different places, data that was in relational databases, data that was in non-relational databases like MongoDB and Elasticsearch and so forth, and they started to stand up these Hadoop clusters running on potentially hundreds of servers, sometimes even more, and creating that single repository where they could load data into that system.

So yeah, I was at MapR from 2009, 2015, one of the first employees there, and it was a great time, as we were enabling all of these enterprises to, for the first time, be able to bring together data from different sources into one place and be able to do something about that data.

SE Daily:

How did the Hadoop adoption change those enterprises? If I'm Coca-Cola back in 2009, 2010, 2011, I decide I want the big data. How does that change how I look at my infrastructure?

Tomer Shiran:

It went from a ... at least, the goal was to go from a world of lots of silos of data and lots of work to bring that data into one place, into having a centralized data lake where you can throw in lots of data, running on commodity servers, so you didn't have to run on high end, specialized hardware either, or MPP appliances. You could just run on commodity hardware that you would buy from companies like Dell and HP and Cisco and so forth, and it was really a very low cost way of storing large volumes of data. That was the significant innovation that came from that technology.

SE Daily:

I think to some extent, the silos of data access patterns and data technologies maybe got simplified, but there are still silos, and the silos today, maybe you could describe them through the lens of the different roles of people who are interfacing with the data and the data infrastructure in different ways. You've got the data scientist, you've got the business analyst, you've got the data engineer. Describe the different roles, the different silos that might exist in a data focused enterprise today?

Tomer Shiran:

Yeah, I think from a personnel standpoint, you have different roles. You have the data engineers, who are responsible for processing the data, cleaning it up, making it available and so forth. You have the data scientists, who are the more technical consumers of data.

When we talk about data consumers, we think of two broad classes of users, one being the business analysts and the other being the data scientist. A business analyst is typically using BI tools and drag and drop tools like Tableau and Looker, Power BI from Microsoft and Qlik and MicroStrategy and so forth, and then you have the data scientists, who are often using more advanced tools, and that could be things in the Python ecosystem, things like R and so forth. And that goes all the way up to more machine learning and AI, but everything on that spectrum between handwriting complex SQL queries to doing more advanced things.

SE Daily:

Today the Coca-Cola CFO asks a business analyst, maybe a data scientist, a question, something relating to revenue, probably. How long does that question take to answer and what are the different points along the data access and data aggregation question into the result that gets received by the Coca-Cola CFO?

Tomer Shiran:

Yeah, I'd say unfortunately, the theory of having all the data in one place remained very much something theoretical that we as an industry, we're not able to accomplish, and even with the rise of the data lake, whether it's the on prem data lake with Hadoop or now things like S3 or Azure Data Lake Store, it's still very much a complex data infrastructure landscape, and so for most companies, they might have that raw data sitting in a data lake, but then, in order to get the performance that they want, they end up having to extract some of that into a data warehouse or an MPP database, then that's not fast enough, so they end up creating aggregation, pre-aggregating the data in aggregation tables, or maybe they end up creating cubes so they can support faster analysis, and then just the raw data as is, is not suitable for a data consumer, for an analyst, so somebody has to go and pre-process that data and so forth.

You end up having so many copies of data and so much work that goes on for everything that you want to do, and so when that business analyst who's serving the CFO of the company wants to do something, unless it's something that they've already done before, chances are they have to go to somebody in IT, like a data engineer, file a ticket, wait for their ticket to get prioritized. The data engineer then has to go bring together the data and maybe run the queries for them, and basically do a bunch of work just for that one simple request that came down from the CFO or maybe just from the business analyst or the product manager.

So it's a very complex process, very long. Often takes weeks and sometimes more, for most companies.

SE Daily:

All those sources of delay that can occur in that significant timeline between the Coca-Cola CFO having a question and that question being answered, you saw these when you were at MapR. You saw these, and they were part of the impetus for starting Dremio. When you left MapR to start Dremio, what were the specific problems in data engineering that you thought you might be able to solve?

Tomer Shiran:

Yeah, we basically looked at the life of a data engineer, and what we saw was that they were not happy, they were overloaded, and they were constantly dealing with very tactical, reactive work. So it was these types of requests from data consumers and others that were asking them, "Hey, can you run this query for me? Can you do this for me? Can get this for me?" And that meant that they were constantly busy doing these support tickets, rather than doing more strategic work, and then at a higher level, the companies were just unable to find enough data engineering talent as well.

I recently had dinner with the heads of data from some of the largest, unicorn type companies in the Bay Area, and what they were saying was that today, because of the complexity of the data infrastructure and the volume of data, you kind of need a data engineer for every analyst that you have, to be successful, and nobody can even get close to that kind of a ratio in terms of finding the talent.

So that made us realize that the solution had to be in technology. There had to be a better way of doing things than having this very IT, engineering driven involvement for every single thing that people want to do with data.

SE Daily:

But unicorn companies, those are not as old as Coca-Cola, so these data problems, these don't just exist at legacy ... well, I don't want to call Coca-Cola a legacy, but it's a older company, just an older enterprise. They also exist at a company like ... I don't know, I'm gonna throw a name out there. I have no idea if this company was at the meeting, but like an Airbnb, where it's a big company but it's still kind of startuppy.

Tomer Shiran:

Right. Right, and I think that's what makes this even more significant, right? So for these companies, maybe they can get to a data engineer for every two consumers of data, but when you get into the more traditional enterprise, those ratios can be 10 to 1 or 100 to 1, and not only that, but their infrastructure, there's so much legacy from just years of operating and acquisitions and things like that, that make the problem even worse. So in some cases, people want to ask a question about data, and it takes months in order for them to be successful at doing that, and it takes a lot of engineering involvement for any kind of new question, and that's a big inhibitor for these companies, who all realize that they have to be data driven, right? They're being disrupted by the likes of Google and Amazon and so forth, but they can't be data driven, because people can't access data and they can't take advantage of data, and in most cases, when something takes you a week or a month, you give up. You just don't do it, right? You move on. And so that's a big problem and that's something that we're trying to solve here.

SE Daily:

And you could solve this problem in different ways at different levels of the stack. Looker, for example. I remember talking to somebody about Looker and they were asserting that Looker approaches this same problem of the different roles and the long lead time from question to answer, but it solves it, I think, more at the BI level. I think of Dremio as, I guess, beneath the BI level, or perhaps also encompassing the BI level, but sort of taking a more full stack approach.

Where would you describe Dremio as sitting in the stack of different tools that are ... because the business analyst is working with a BI tool, the data engineer is working with the data infrastructure, the data scientist is interfacing ... I don't know, more in the middle or along the entire surface of everything. Where in the stack is Dremio encompassing?

Tomer Shiran:

Yeah, that's a good question. So we actually don't get involved in the visualization layer. We leave that to companies like Looker and Tableau and Microsoft Power BI, which do a great job there. And those companies have actually focused on making the visualization layer self-service, so you don't need somebody technical to create the visualization or do the report for you.

The problem that we're solving is that everything underneath that layer, which today encompasses a lot of ETL and manual data engineering and cobbling together lots of different solutions, that layer is very much not self-service, and so we're very focused on making the rest of the data stack self-serviced just like companies like Tableau and Looker and Microsoft Power BI make the visualization layer self-service.

SE Daily:

Okay. The interaction between data scientists and data engineers and business analysts. What should that interaction be like?

Tomer Shiran:

That's a great question. It's exactly the kind of things that we focus on, because we've built an open source platform that facilitates that interaction, so it's not this file a ticket, wait three weeks to get something. We believe that data engineering should be responsible for doing that initial maybe collection and processing of data into some kind of a state where it's broadly useful to the general audience of data consumers within the company. But then the challenge is, every data consumer, every business analyst or data scientist, wants things a little bit different. It could be as simple as the coms named differently, or they want some data set in the organization joined with their own custom spreadsheet. So all those kinds of things, we call that the last mile, similar to the last mile in logistics or in the telco world, where that's the hardest problem. We want to solve the last mile problem, and we want to make it so that the data engineers don't have to get involved in the last mile, because that's so specialized and customized to each individual user that it just-

 

 

Section 1 of 3          [00:00:00 - 00:20:04]

 

Section 2 of 3          [00:20:00 - 00:40:04]
(NOTE: speaker names may be different in each section)

Tomer Shiran:

Specialize and customize to each individual user, that it just bogs them down if they have to serve each of those users. So, we think that data engineers, they should worry about the long haul. The more kind of standardized processing of the data and infrastructure in the company. And we want to provide technology that makes the data consumer much more self-sufficient, so they're not constantly bothering the data engineer with individual tasks.

SE Daily:

And give a little bit more description for the frictions that exists within these specific roles and between these specific roles. Like, the problems that you need to be able to solve.

Tomer Shiran:

Yeah. Let's look at a few examples. So one simple example is, let's say I'm a business analyst or a product manager, I want to do something with data. And maybe that data is in two different places to date. Two different systems. And so today, without Dremio, the solution is I file a ticket within the support portal that I have in the company. And at some point that gets prioritized and somebody from data engineering can start a project to integrate those two sources. Maybe load them into a centralized data warehouse or an S3 bucket, or something like that. So that's one example of something that we'd like to make it so that the data engineer doesn't have to get involved.

Another example is maybe as a business analyst I'm doing some analysis in Tableau, or Looker, or something like that. Or maybe using Python. And the queries are just too slow. I'm not getting the performance that I want. And so again, that today without Dremio would become a data engineering task. Somebody would have to pre-aggregate the data and maybe sessionize it, maybe aggregate it by city or something like that, so that I can get a faster response time. Or maybe they have to load it into an in-memory database like Hana.

So there's a lot of work that would have to happen so that I could get the performance that I need in order to be able to interact with that data. So again, that becomes a multi-week project to move the data into a faster source or to process the data in a specific way that would make queries go faster. But then chances are I'm then going to want to do something different with that data. So that processing that had taken place is no longer helping me get the performance, so something needs to be adjusted or a cube has to get rebuilt. There's just so much back and forth every time, I'm not getting the performance that I want or I don't have access to the data that I need.

SE Daily:

When you started working on Dremio, how did you think about addressing those frictions? More specifically at an engineering level, what did you think that you could build to be able to address those frictions?

Tomer Shiran:

Yeah. So we envisioned a platform. We now call it a data as a service platform because we see that companies across every vertical, across every industry, want to deliver data as service internally within their organizations. And that platform that we envisioned would be something that could connect to any data source that the company has. And so we started kind of focusing on the data lakes that people had already built, as well, some of the relational and non-relational databases. And then, something that would allow them to continue to use the existing BI tools and data science applications that they have. You know, things like Pandas and Tableau, and Power BI, and so forth.

And really at the core of the system is this idea that the only way to solve this problem is to kind of create this abstraction layer, which allows the consumers of data within a company to be able to interact with data, and explore data, and analyze data through kind of an abstraction layer, so that they can do data prep. They can join things. But do all of that without creating copies of data. So do it all in kind of a virtualized way. And then the system would provide the execution and query acceleration and caching capabilities that were needed to make things go fast, irrespective of what was done at the abstract level.

And so, that's really what we built. If you think about from the user interface standpoint looks a lot like Google Docs, except instead of docs it's datasets. And so users can create new datasets. We call them virtual datasets. They can then share them with their colleges who can build on top of that. And the company has the ability to see that data lineage, and what's been built, and what's dependent on what. And all that is basically at zero cost because it's all virtual datasets. They're not creating copies of data.

SE Daily:

Right. The last time we spoke about those virtual datasets. And virtual datasets are the datasets that I want Dremio to be aware of across my organization, whether it's MySQL or Elasticsearch or HDFS, [inaudible 00:25:17] that joins, or the other expected queries to join up data from MySQL with Elasticsearch, for example. And then, you could turn that into a reflection as a materialized view and have that expensive query be run and be stored as that materialized view. And then, you could access it to when you wanted to. So you have this distinction between the virtual dataset, which doesn't cost you anything and the reflection that is a cache, a materialized view. Is that right?

Tomer Shiran:

That's correct. That's correct. So the virtual datasets, that's that abstraction layer. That's what allows people to go from ... By default, you connect Dremio in your environment you can get started. You can do joins between the different physical datasets that you're connected to. You know, the directory of CSV files that you have, or Parquet files in S3 with your Oracle table. But then the virtual datasets are the way in which the users can create new datasets that maybe have some kind of data prep, data curation done on them. Maybe they're a subset of the data, maybe they're filtered, maybe they're aggregated, maybe they're joined between two sources.

But then as the BI user, for example, is querying these virtual datasets, we want to make sure that these queries go fast. And so the way you make something go fast is by maintaining various data structures that make that go fast. Right? Kind of in the same way that you think of when you search the web through Google, you're getting a very fast response time. And that's because Google's not actually going and scanning all the web pages on the internet when you run that search query. It's because they've built indexes and various models that they use to support answering that query in a much faster time. Right? And database indexes, if you look at Oracle indexes or cubes and kind of the OLPA world, the idea is the same. Right? You have data structures that make it much faster and easier to answer queries.

The idea with the way that we do these data reflections, which is kind of one of the key innovations here, is that the data consumer, the person who's running the queries, is not even aware of their existence. And so those are completely behind the scenes. They get managed and maintained by the system. And their queries just magically go faster than they would otherwise.

SE Daily:

So the notion of the virtual dataset, if that's the data that I already have across my organization, like a virtual dataset would be an Elasticsearch index for a MySQL database. Why is it useful to be able to have that as an extraction in Dremio? If the virtual dataset already exists, why is it useful to have it as something within Dremio?

Tomer Shiran:

Yeah. Let me just clarify. When you connect Dremio to an Elasticsearch cluster and let's say, you're Azure Data Lake Store. Right? Or your S3 bucket. The things in those systems, we call them physical datasets, and we actually never change them. But what happens most of the time is that people don't just want to analyze or expose the raw data in those systems, they actually want to do some additional work on that before they do their analysis. And so that additional work, just to give you some color here, it may be as simple as I have an index in Elastic or maybe I have an S3 bucket that has data about business. But maybe I just want to do an analysis of businesses in the US. And so, I'll create a virtual dataset on top of that kind of physical dataset, that has some additional filters in it. And I can do that either visually or if I know SQL I can actually do it through SQL in Dremio's interface. So that's creating a new virtual dataset.

Another example would be if I wanted to join data between two different systems. So I could take two physical datasets, one in Oracle one in Elasticsearch, and I could then join those two things, either visually again in our user interface or through SQL. And then, save that as a new virtual dataset. And so, that virtual dataset actually doesn't contain a copy of the data, but if I connect now Tableau or Power BI or Looker to Dremio, that virtual dataset appears to the BI tool just like any other table. And so the BI user can then start exploring and analyzing that virtual dataset.

SE Daily:

How does Dremio discover the schema of my different physical dataset sources?

Tomer Shiran:

Yeah. So a lot of work goes into both the way we connect to different data sources and understand their schemas, as well as, how we deal with changes in that schema. First of all, a lot of systems have self-describing data. Right? If you look at say, Parquet files on S3, we can interpret the schema from those Parquet files. If you look at Elasticsearch it has something called mappings and those determine the schema. Of course, every relational database has a schema and JSON documents are generally self-describing. That said, in many cases where it's not a simple table. Maybe it's you're connecting to an S3 directory there that has files that may have different structure in them. And so we have this kind of schema learning engine where over time as we're observing data through the execution of queries, we're kind of adjusting our internal understanding of what that data looks like and what that schema is. And so, we have this entire learning algorithm around schema. We call it schema learning engine.

SE Daily:

The reflections that you talked about, this important smart caching layer that makes the queries that analysts and data scientists are going to have, it's important that this query system is a little bit intelligent and can do things on its own, rather than having the engineers specify everything. How does Dremio figure out which queries, which reflections to materialize into a file that will accelerate the actual usage of Dremio?

Tomer Shiran:

Yeah. So there are kind of two aspects to how reflections work. One of them is how to decide which reflections to create. And then the second one is when a query comes in, how do we even figure out that we can leverage one of these many reflections that you might have in the system? Right? And actually, by the way, the reflections get stored in something like an S3 or an Azure Data Lake Store, or HDFS, typically.

SE Daily:

So it's cheap?

Tomer Shiran:

Very cheap. Yeah. So it doesn't have to be kind of fit in-memory. You don't have to have loads of memory in the system, which is typically where these things get really expensive. So to start with, when a query comes into the system, we have a sophisticated optimizer that looks at that query, compiles it into a query plan, and then basically, runs a variety of algorithms to understand whether one or more reflections that are available in this storage layer could potentially be used instead of scanning the raw data. So that's where we'll potentially re-write the query plan internally so that instead of scanning a trillion records, maybe we only have to scan a billion records. And then kind of roll that up and do some additional processing on that to return the answer to the user, which is the exact same answer that they would have gotten if we had scanned the raw data. So that's really the query substitution layer, the reflection substitution layer where we're trying to take a query plan and understand whether we can accelerate it by rewriting it to use reflections.

The question of, well in the first place which reflections get created in the system? We have two things right now that we do. And then, something more significant we're working on. And so what we have right now is, first of all, users can basically vote on specific datasets in the system. So if you're working with a dataset, whether it's a physical dataset or a virtual dataset, and things are too slow, you basically have it's almost like a Facebook Like button where you kind of up-vote that specific dataset. And then, the administrator of the system can see the votes and see which datasets people are more excited about. And then they can enable reflections on those.

And then, when it comes to an individual dataset in the system, and wanting to create reflections on those, we'll provide some basic recommendations based on things like the cardinality of different columns. There's also something we're working on now, which is more based on the user behavior. So a given amount of capacity that you're willing to allocate, let's say in your S3 buckets, basically a quota, automatically determining based on query history, what is the best bang for the buck in terms of the right reflections to create? So that's something that we're working on. Basically, kind of very sophisticated machine learning engine.

And in addition to that, it will always be important to give users and the admins of the system some controls around this. To go back to your CFO example, the CFO might be doing something that's pretty unique. Nobody else in the company does it, so it's not very common. But because they are the CFO, they expect things to be fast. Right? They're maybe more important than other people in the company. Right? And so that's something that would be hard for a system to really know without connecting to their HR database. So we'll still always give people the controls to be able to, kind of even all the way to manually defining reflections to create.

SE Daily:

What are the steps to executing the query against Dremio?

Tomer Shiran:

To external tools, Dremio appears like a relational database. And so if you connect a BI tool like Tableau or Power BI to Dremio, it thinks that it's connected to a relational database. And so you can just kind of drag and drop things in the interface. Create a new chart, create a new report, or a dashboard and things will just work. And so we provide ODBC, JDBC drivers, as well as a REST interface. And some tools just already support Dremio natively and you don't even need to use any of these drivers.

If you're more of a data science type user, data scientist, a lot of our users use Jupyter notebooks as a way to interact with Dremio. And so we have very nice integration with the Python stack and Pandas specifically. Part of that comes from the fact that we created a project called Apache Arrow about a year and a half ago. And Arrow, since we open-sourced that and worked with the Python community, that's since really grown in adoption. Almost 200,000 downloads a month, now. And it's embedded into everything from Pandas, to Spark, to H2O, InfluxDB. And we're working with various different organizations and companies like NVIDIA, for example on Arrow. And so that ability of Dremio to integrate very well, especially with the data science tools, is something that's very unique here. Right? And that's why we see a lot of our users also using things like Jupiter and Pandas, and entire ecosystem on top of it.

SE Daily:

If I'm an engineer at a company like a Coca-Cola, or an Airbnb, or any organization that's large enough to have a lot of engineers, and multiple data scientists, and multiple business analysts, and disparate data sources, it would be nice to have this data tool that stretches across my entire organization. And I can go into this data tool and connect data from one piece of the organization to another. Unfortunately, it's not a practical reality, not only from the engineering standpoint but from a data governance standpoint because there's the principle of least privilege as it applies to data.

Because if you have a 10,000-person organization, you should not have access to all of the data in that organization. There's privacy rules. And there's just certain teams should not know what other teams are doing. I think the term Chinese wall is sometimes used. At least in financial institutions where one part of the organization can't know about data in another part of an organization. So I think that's one thing that leads to silos. But in some ways, it's good that there are silos there. So if you're trying to build a tool where you can join disparate datasets, the tool has to be compliant with those data governance walls. How do you handle that aspect of large enterprises?

Tomer Shiran:

Yeah. I think you're hitting on something very important here, which is companies want to ... We're offering, really, a data as a service platform. And that's because companies want to offer data as a service internally and there is no practical way for them to go about doing that. And so, that's kind of the fundamental probably that we solve. I think a big part of this though, is also making sure that the users or consumers of data are only allowed to see what they're supposed to see. Right? So when we connect to various data sources, the first thing is we always observe the permissions of the user within that data source. So when we're getting data from HDFS or let's say something like a relational database, we're actually leveraging the user's identity to make sure that we're only returning things that they're allowed to see. And that-

 

 

Section 2 of 3          [00:20:00 - 00:40:04]

 

Section 3 of 3          [00:40:00 - 00:58:44]
(NOTE: speaker names may be different in each section)

Tomer Shiran:

Only getting things, we're only returning things that they're allowed to see. That actually works throughout the entire caching layer as well, we always make sure that a user will never get data they are not supposed to see. Then also in the abstraction layer, when it comes to these virtual datasets, we actually make it possible for companies to control who gets to see what data. As a data engineer, you may be responsible for making sure people should only get access to what they are supposed to get access to. Maybe you don't want to expose the raw data that you have, let's say, in your data lake. So, using Dremio you can actually control that further. You may say, “You know what, I'm not gonna expose the raw, the physical datasets to anybody. I'm going to create some curated datasets that have the Social Security Numbers stripped out of them, and I'm only going to expose that virtual dataset, that's maybe kind of watered down, to the analysts. Maybe for the data scientists, I'm willing to give them a little bit more, and they are allowed to see something else.” And you are gonna actually do this at the column level, based on the users and groups that the company has.

But I think that the thing to realize here is that users will work around the restrictions and the inability to get things. They will get their work done, in many cases, and they'll work around IT and the governance controls that people have in place. That's why we believe that the only way to get security is to provide users with a way to accomplish what they want, but to enable them to do that in an IT-governed system. So when you are exposing data through something like Dremio, one of these data as a service platforms, then IT gets to see who's doing what with the data. Who's accessing what data. You get to control what they're allowed to see, and you get to see the entire lineage of data. So this virtual dataset, you actually see a graph that shows a virtual dataset, what its ancestors are, what their ancestors are, you kind of browse that graph like you kind of browse a Google map.

So that to me is the key. Self-service in many ways is critical in order to achieve security, because otherwise people are downloading data into spreadsheets, or sending them around in emails, or extracting them into departmental BI servers. It just becomes a lot of worse, and you don't actually know what people are doing with the data at that point.

SE Daily:

When I started Software Engineering Daily, I started going to conferences and when I started going to these software engineering conferences ... originally I was most interested in going to the talks. I went to the talks and I learned about the same things that people who listen to this show learn about: how databases work, how programming languages work, software architecture strategies, things like that. Over time, I actually became more interested in the goings on in the expo halls. I'm sure you've been to enough conferences where you've been to a lot of different expo halls, but these expo halls where you have all these different companies, they are presenting their products, because conference-goers are walking between the booths at the expo halls and talking to these different products. The products are making their pitches and giving their vision of the world because this is part of the sales process, of selling to the engineers, selling to the CTOs, selling to the CIOs.

That sales process has fascinated me over time, because if you are selling a product into an enterprise, you have to know where's the entry point. Where are you getting a foothold, where are you explaining the value, because Dremio is not a simple product to explain. A lot of these companies that are selling to developers, it's often solving a very subtle problem that the enterprise that is being sold to, may not even understand that they have. So you often need to talk to the engineers specifically and say “Look, you have this problem, you know you have this problem, and I need you to go back and talk to your CEO, or talk to your CIO and sell them on this idea because it's important.”

So, in building Dremio, what have you learned ... I'm sure you saw plenty of this in MapR, so you already have domain expertise, at least to some degree. But in building Dremio, what have you learned about the entry point, what is the way that you convince people that this is the approach. This is how you solve some of the data engineering, data science, data access problems at your organization.

Tomer Shiran:

Yeah, in most of the companies that we work with, and we're now deployed in everything from large enterprises like Intel, TransUnion, Royal Caribbean, all the way to smaller startups, by and large there's an organization that's responsible for the data infrastructure, delivering data as a service within the company. Often this is a data engineering team, it's the same group that's responsible for the data lakes, the data warehouses, the ETL, and so forth. So I think there's clear, for us at least, there's a clear buyer for the technology. Now, we do always make sure that we're also interacting with the consumers of data - the data scientists, and the BI users, because when they see the product, they really want it. So that helps internally, within those companies, helps them understand the value of proposition as well.

I think a big thing that we're doing different here, we are all big believers in bottom-up adoption. If you look at our executive team, it's a lot of the executives from MapR, and from MongoDB as well. We very much believe in the open source model. We actually created Dremio as an open source technology. We allow people to download it. We have a Community Edition they can download, run in production. We now have thousands of companies downloading that every month, so that's been very successful, and a lot of the companies that we've went on to do business with, over the last nine months since we've launched the product, have actually started by just downloading the community edition from the website. I think that makes things a lot easier and it's also very much how people want to consume software these days. So that's been our approach.

It's been so far working out really well, just in terms of the volume of these downloads and the wide range of customers that we've been able to acquire across every industry you can imagine, from insurance companies to the largest tech companies, and then across every continent, from Australia to Singapore, to different countries in Europe, and of course the US.

SE Daily:

In the last month we've had a few recent shows about different solutions to this data sprawl that we've outlined in our conversation. So we had a show about Uber that's been pretty popular, and what Uber does is, at the highest level they expose Presto basically, which is a MySQL interface that translates queries into whatever kind of backing store the data is stored in. That's one approach we've heard. Another approach is Citus Data, which suggests if you get all of your data in Postgres, then you can perhaps have the Postgres extensions system take care of all the variability in queries and you can have optimizations in that world.

And I know these are not totally disjoint approaches. There's probably that are gonna have both Presto and Dremio, there's companies that will have Citus Data and Dremio, there's companies that will have just one of these three. There's companies that will have completely other things. When you look at the spectrum of approaches to solving this data sprawl, what are your beliefs about how things are going to change in the future? How do you contrast the different approaches to solving that data sprawl?

Tomer Shiran:

I think one of the reasons we started Dremio is because we saw that with just SQL engines, whether it be Hive, kind of in the Hadoop space, or Presto, etc., that wasn't enough to solve the problem, to make users self-sufficient with data and to give them the performance they want. And that may work for some very large organizations that are willing to run systems on thousands of nodes and throw hundreds of data engineers at the problem, but there are very few companies, like Uber and like Google, to have their own internal solutions and that would want to do that.

If you look at what we're saying is that data as a service, doing that internally, is much more than just having a SQL interface. It's the ability to accelerate these queries so that the BI user can get a subsecond response time when they do have terabytes of data. It's the ability to join data across different sources and to have an interface that looks and feels like Google Docs, where people can collaborate and build on top of each other. It's the ability to visually curate data for people that are not engineers as well, because otherwise they constantly bother the engineers.

So I think we looked at a much more of a full stack approach. If you thought of the iPhone and SLR cameras as competitors. You could think of it that way, they both take pictures and I'm sure the iPhone has taken market share away from some of the traditional camera manufacturers in the market, but I think the value proposition is very different. The reason with my kids I go and use the smartphone to take pictures and videos of them is because it's then very easy for me to share them on WhatsApp, and on Facebook, and it gets backed up automatically on Google Photos. All this additional value that comes from that deeply integrated system. That's how we think about solving this problem. It's not enough to have ten different technologies that I've cobbled together and threw a lot of manual work at, we think that the experience has to be a lot better.

If you think about what Splunk did, prior to Splunk people wanted to analyze logs, it's not like they invented that problem, but they had to cobble together different solutions and use shell scripts and load logs into MySQL, and all these different things and a lot of work that came with that. Splunk came and said “Hey, here's much more elegant, dedicated solution for this problem.” And that's how we think of what Dremio is doing for the world of data analytics.

SE Daily:

I know your time is short, but one other future-related question. So, much like Google Docs, or your camera application, or Splunk for logging, these problems that, from high level, may look like just engineering problems that don't require machine learning, they're just figuring out the building blocks and then optimizing them by hand. In 2018, we're starting to see the benefits of putting machine learning in these kinds of systems. Even for data platforms, there was this paper from Google, that maybe you saw, about learned database indexes outperforming these manually created database indexes. What are the opportunities for machine learning in building a better, more efficient data engineering platform?

Tomer Shiran:

There's a huge opportunity here because you can do so much by just understanding what people are doing and what they want to be doing. That goes for everything from understanding what are the right data structures to create underneath the hood automatically, without asking anybody, just by observing, for example, the query patterns. That's why we really like our position as the tool that sees what everybody's doing, sees all the queries that are running across all the different sources, and being able to leverage that. Understand, there's a lot we could be doing there for sure with leveraging that knowledge and utilizing it to make future queries go faster. We already do things like recommending joins. So when you look at a dataset in Dremio and you click the “Join” button, we'll say “Hey, you might want to join this with this other dataset, based on the behavior of other users who have joined that dataset, or maybe something even derived from that dataset, with other things.”

Just building a tool is not enough, with a productivity tool, you really want to be able to leverage that understanding of what people are doing over time and also how's that's changing, and also looking at the data itself. A lot of the things we do is, we observe the relationships within the data as we are running these various queries and joins in the system. We can then make smarter decisions about how to accelerate things just by understanding that maybe that is a one-to-many relationship based on our historical observations.

Things of that nature, there is a lot of additional opportunity here when you start thinking about “Okay, I know what people are doing with that data, I know what they are accessing, how they are doing it, and so forth.”

SE Daily:

All right, well, to close off, what else are you working on at Dremio, what do you have in store for the near future?

Tomer Shiran:

One of the things that we'll be announcing soon is a new initiative around Apache Arrow. Arrow, I think I mentioned earlier, is an open source project that we created a year and a half ago and has really taken off as a foundational component for dozens of different open source and commercial products out there, from time series databases, to GPU databases, to Spark, to Python, and, of course Dremio. So we are working on a number of different new capabilities and extensions of Arrow, that will make Arrow-based systems anywhere from 5 to 10 times faster and also provide orders of magnitude faster integration.

So, today, systems integrate based on very old protocols and interfaces like ODBC and JDBC. For data science, we think there's a need for something much, much faster. So we're working with Wes McKinney, who is the creator of Pandas, and really designing a next generation interface for data, [inaudible 00:56:08] data in memory to move between systems. And so we have something coming up in the next couple of months.

And then a lot of additional capabilities, also inside of our own open source platform. That includes really advanced workload management capabilities. Many of our customers like TransUnion, they have hundreds of users that run on the platform and they want to very intelligently prioritize the use of resources among all those different users for very high concurrency levels. So that's something we call “Workload management”, or kind of mixed-generation workload management.

We are working on the ability, for example, as datasets continue to grow in size, how do you leverage both GPUs, as well as the available disc space that you have in the cluster, so that even if you run out of memory, you can complete all your queries in an efficient way.

Lots of optimizations around performance, concurrency and workload management, in addition to what we're doing with Apache Arrow.

SE Daily:

All right, cool. I think I've said this before, but I think it's impressive, the three year, four year time between starting Dremio and getting to this point where you've got some serious customers. I think it says something about the delayed gratification to getting to this place, where you have good customers, or I should say really strong name brands. I think it says something about the vision that you had from the beginning. I'm always impressed when a company is able to take a really, really long vision, and I think three years is not tremendously long, but it's pretty long in the world of software engineering tools. I'm really happy to see you doing well.

Tomer Shiran:

Yeah, thank you. It's been very exciting. People say it takes years to build a database, but for us to be at this point where seeing thousands of people download it every month, I think in the last few months it's grown 30 percent or something like that, month over month, it's really taking off.

SE Daily:

That's compounding. Okay, well, Tomer, thank you for coming on the show. It's great to have you.

Tomer Shiran:

Yeah, thank you so much, I appreciate it.