Dremio Jekyll

Software Engineering Daily Podcast October 2017

Dremio CEO Tomer Shiran was a guest on the Software Engineering Daily podcast to talk about how Dremio works and who it benefits.

Listen to the entire podcast over on software engineering daily.

Transcript

SE Daily:

The MapReduce paper was published by Google in 2004. MapReduce is an algorithm that describes how to do large scale data processing on large clusters of commodity hardware. The MapReduce paper marks the beginning of the big data movement. The Hadoop project was an open source implementation of the MapReduce paper. Doug Cutting and Mike Caffarella wrote software that allowed anybody to use MapReduce as long as they had significant server operations knowledge and a rack of commodity servers.

Hadoop got deployed first at companies with the internal engineering teams that could recognize its importance and implement it. Companies like Yahoo and Microsoft. The word quickly spread about the leverage Hadoop could provide. Around this time every large company was waking up to the fact that it had tons of data and didn't know how to take advantage of it. Billion dollar corporations in areas like banking, insurance, manufacturing, and agriculture, all wanted to take advantage of this amazing new way of looking at their data but these companies did not have the engineering expertise to deploy Hadoop clusters.

Three big companies were formed to help bring Hadoop to large enterprises, Cloudera, Hortonworks, and MapR. Each of these companies worked with hundreds of large enterprise clients to build out their Hadoop clusters and help them access their data. Tomer Shiran spent five years at MapR seeing the data problems of these large enterprises and observing how much value could be created by solving these data problems. In 2015, 11 years had passed since MapReduce was first published and companies were still having data problems. Tomer started working on Dremio, a company that was in stealth for another two years.

I interviewed Tomer two years ago and when he still could not say much about what Dremio was doing, we talked about Apache Drill, which was an open source project related to what Dremio eventually built. I should say, is an open source project. Earlier this year two of Tomer's colleagues, Jacques Nadeau and Julien Le Dem came on to discuss columnar data storage and interoperability and what I took away from that conversation was that today data within an average enterprise is accessible but the different formats are a problem.

Some data is in MySQL, some is an Amazon S3, some is in Elasticsearch, some is in HDFS, stored in parquet files, and also different teams will set up different BI tools and different charts that read from a specific silo of data. At the lowest level, the different data formats are incompatible. You have to transform MySQL data in order to merge it with S3 data.

On top of that, engineers doing data science work are using Spark and Pandas and other tools that pull lots of data into memory and if the end memory data formats are not compatible, than the data teams can't get the most out of their work. They can't share their datasets with each other. On top of that, at the highest level, these data analysts that are working with the different data analysis tools creates even more siloing.

Now, I understand why Dremio took two years to bring to market. Dremio's trying to solve data interoperability by making it easy to transform datasets between different formats. They're trying to solve data access speed by creating a sophisticated caching system and they're trying to improve the effectiveness of the data analysts by providing the right abstractions for someone who is not a software engineer, to study the different datasets across an organization.

Dremio is an exciting project because it's rare to see a pure software company put so many years into upfront stealth product development. After talking to Tomer in this conversation, I'm looking forward to seeing Dremio come to market. It was fascinating to hear him talk about how data engineering has evolved to today and some of the best episodes of Software Engineering Daily covered the history of data engineering, including an interview that we did with Mike Caffarella, who was the co founder of Hadoop.

We also did another episode called the history of Hadoop in which we explored how Hadoop made it from a Google research paper into a multi billion dollar, multi company industry and you can find all these old episodes if you download the Software Engineering Daily App for IOS and for Android. With these apps, we're building a new way to consume content about software engineering and they're also open sourced at github.com/softwareengineeringdaily. If you're looking to get involved in our community and contribute to the open source projects, we would love to get your help. With that, let's get on to this episode.

You are programming a new service for your users or you are hacking on a side project. Whatever you're building, you need to send email and for sending email, developers use SendGrid. SendGrid is the API for email, trusted by developers. Send transactional emails through the SendGrid API. Build marketing campaigns with a beautiful interface for crafting the perfect email. SendGrid is trusted by Uber, Airbnb and Spotify but anyone can start for free and send 40,000 emails in their first month. After the first month, you can send 100 emails per day for free. Just go to sendgrid.com/sedaily to get started. Your email is important, make sure it gets delivered properly, with SendGrid, a leading email platform. Get started with 40,000 emails your first month at sendgrid.com/sedaily, that's sendgrid.com/sedaily.

Tomer Shiran is the CEO of Dremio. Tomer, welcome back to Software Engineering Daily.

Tomer:

Oh, thanks for having me.

SE Daily:

The last time we spoke we were talking about Apache Drill and in another episode, I talked to your colleague, the CTO of Dremio who is Jacques Nadeau and we talked about columnar data in that episode. And in both of these episodes I knew that the two of you were working on this stealth company, Dremio. I didn't know much about what you were building. Now that the product is out, I want to take a top down approach and we'll discuss what Dremio is and then we'll discuss the technical topics that we discussed in the past two episodes and sort of how they relate to the construction of this product.

To start off, it's 2017 and we've got teams of data scientists, data engineers, data analysts, these data teams that are also working with software engineers. They've got tons of data and they have some problems managing and accessing and visualizing that data. What are some of the specific problems faced by these teams of data engineers and data scientists and software engineers?

Tomer:

Yeah, so if you think about our personal lives and how easy it is when we go home and we have a question and we go online and ask that question on Google and, you know, one or two seconds later we have an answer, right? And you know, we have this amazing experience with data in our personal lives and that extends to smartphones and we want to book travel and within two minutes we book travel and it's very simple but then we come to work and it often takes us months to be able to answer a new question or create a new visualization, especially when you get to the enterprise where data is distributed all over the place and kind of owned the by different teams. And a lot of work has to happen in order to make that data available for somebody to be able to ask questions to it, on that data.

So that's kind of the core problem and a lot of times you'll see companies go through lots of kind of ETL work, where they're extracting and transforming data and they have to figure out some kind of data warehouse where they can load that data into and make it available. And it's just a lot of, lot of work and that takes months to do. And so that's a big challenge.

SE Daily:

Some of this data is sitting in Amazon S3, some of it's sitting in Elasticsearch, some of it sitting in Mongo, is the data in all of these different places, is it in the right format to be queried by these data teams?

Tomer:

It may or may not be. You know, I think the world has moved to a place where we have lots of different types of data stores and you know, each of these data stores is really optimized for building different types of applications. And so developers that build an app on, a web app may choose Mongo because that's easier to build the app there. Or for some other use cases for Elasticsearch and you know, maybe put the log files on S3 and really they were kind of optimizing for what's the best place for me to put the data for the application that I'm trying to build as opposed to, for the type of analysis that I'm, that somebody may later want to do with that data.

And so that's a challenge, right? If you think of the old world, maybe it was possible to have all my data in one relational database and I could just as easily query that data and do my analysis directly on that. Let's say Oracle database, but that's obviously no longer the world, right? With today's kind of volume and variety and complexity of data it's just way beyond a place where we can just have all our data somehow magically in a relational, one relational database and expose that to a bunch of BI tools. It just doesn't, that's just not feasible anymore.

SE Daily:

Do we want to uniformly turn these datasets into a single access system with consistent latency, consistent formatting? I mean, one thing we could talk about is columnar data, I think we will talk about that. Is the goal of Dremio to uniformly turn these datasets into columnar data?

Tomer:

It's actually, I would, I would describe the goal really as self service data. We, our goal at Dremio really, is if you think of this new world where the data no longer can realistically be in one place, in one relational database and at the same time, you have this growing demand for kind of self service access to the data from, you know, everybody from the data scientist to the, you know, the product manager and the business analysts and so forth. You know, how do we, how do we create a way for and for these people to be self sufficient, to be empowered to do whatever they want with the data, no matter where that data is, how big it is, what structure it's in? And so to do that, we have to solve a variety of different problems that you know, the traditional data infrastructure just doesn't deal with, right?

If you think about, historically you know, we've had data in different places, we would then have to ETL that data into maybe some kind of staging area like a, like a data lake or a Hadoop cluster or something like S3 and then you know, querying directly on that kind of system is more often than not, too slow. So companies will tend to ETL a subset of that data, maybe the last 30 days or the last, or some aggregate level data into a data warehouse. And that's not fast enough, so they create cubes and they pre aggregate into other tables in the data warehouse and maybe they extract into the BI servers. And at the end of all that you have 10 different copies of the data and really a lot of manual work that has to be done by engineers every time somebody has a question or wants to do something new.

And so we think that in order to achieve this world where companies really want to leverage data, they want to be data driven, you have to create a system that empowers the end user, the data consumer. Whether they're a data scientist who's using Pandas or a business analyst using Tableau, how do you empower that user to do everything on their own and get the performance that you know, that they need? Which is often sub second response time even when the data but, when the datasets are you know, even petabytes in size.

SE Daily:

Alright, I think we understand this from the high level product perspective. What are the features that you need to build in order to make that data access easier? Are we talking about a visualization product? Are we talking about a query language? Are we talking about some sort of dashboard with both of those things built into it? Are we talking about an API? What are the features that you need?

Tomer:

Mm-hmm (affirmative). So what Dremio provides, and by the way, Dremio is available as an open source project as well as kind of an enterprise edition and so you can download it and you know, there are basically two aspects to it. So on one hand, if you think about most companies, there are different users that want to use different tools to explore and analyze data, ranging from BI to Excel to more advanced things like R and Python.

And so we don't want to create a visualization tool or something that people use to analyze the data. They already have plenty of those tools. But we do want to provide these data consumers the ability to access and analyze any data at any time. And so we provide a number of capabilities in that regard and that includes kind of an integrated data catalog where they can find the data that they want and kind of use a search type interface for that.

And we provide them with a visual interface that they can curate the data and create new virtual datasets and collaborate with their colleagues. And at the end of the day, we want to enable their existing tools, whether that's Tableau or power BI or ClixSense or R or Python, to be able to connect to the system and run the query and get a response in less than a second, no matter how big the data is or where it's coming from.

And when it comes to, so for the data consumer, we want them to live in a logical world where they feel that they can do anything with the data at anytime. Now at the same time, we have to provide the execution and acceleration capabilities that it will actually make that fast. And so that's where, underneath the hood, there's an entire kind of SQL distributed execution engine leveraging Apache Arrow. There's an acceleration layer where we've pioneered something called Data Reflections, which can accelerate queries by orders of magnitude. And then there's kind of this data virtualization there that knows how to talk to different databases and push down query's or parts of queries into these underlying databases, whether there are no SQL databases like Elasticsearch and Mongo DB or it's relational databases like Oracle and SQL Server and iSQL.

SE Daily:

You talked about a few of the technical concepts there. The reflection concept and the virtual dataset concept. What's the right order that we should approach these concepts to dive into them?

Tomer:

Right, so if you think about what the, let's say the business analyst who's using the system or the data scientist who's a user in there and they want to work with the data. They're never aware of reflections, so I think we should first focus on kind of from their experience, they're dealing with datasets. And so you have the physical datasets, those are the things that are in the collections in Mongo DB and the indexes and Elastic and the tables and Oracle and the hive tables in Hadoop. Those are physical datasets and then we allow these users to create their own virtual datasets, basically views of the data and they can share them with their colleagues and build on top of each other and so forth. And so users always think about the world and in terms of datasets in both physical and virtual datasets.

SE Daily:

DigitalOcean Spaces gives you simple object storage with a beautiful user interface. You need an easy way to host objects like images and videos. Your users need to upload objects like PDFs and music files. DigitalOcean built Spaces because every application uses object storage. Spaces simplifies object storage with automatic scalability, reliability and low cost. But the user interface takes it over the top.

I've built a lot of web applications and I always use some kind of object storage. The other object storage dashboards that I've used are confusing, they're painful and they feel like they were built 10 years ago. DigitalOcean Spaces is modern object storage with a modern UI that you will love to use. It's like the UI for Dropbox but with the pricing of a raw object storage. I almost want to use it like a consumer product. To try DigitalOcean Spaces, go to do.co/sedaily and get two months of Spaces plus a $10 credit to use on any other DigitalOcean products. You get this credit, even if you have been with DigitalOcean for a while. You could spend it on Spaces or you can spend it on anything else in DigitalOcean and it's a nice added bonus just for trying out Spaces.

The pricing is simple. $5 per month, which includes 250 gigabytes of storage and one terabyte of outbound bandwidth. There are no costs per request and additional storage is priced at the lowest rate available, just a cent per gigabyte transferred and two cents per gigabyte stored. There won't be any surprises on your bill.

DigitalOcean simplifies the Cloud. They look for every opportunity to remove friction from a developer's experience. I'm already using DigitalOcean Spaces to host music and video files for a product that I'm building and I love it. I think you will too. Check it out at do.co/sedaily and get that free $10 credit in addition to two months of Spaces for free. That's do.co/sedaily.

So the virtual datasets are these in memory representations of the datasets, the physical datasets that are probably represented on disk?

Tomer:

Yeah, the physical datasets are represented typically on disk in some source system. The virtual dataset really is, it's not an in memory representation is just a logical definition right? And that's the beauty of this is that you can then have a thousand users creating these virtual datasets. There's virtually no cost to these things and they can create as many as they want and at the end of the day that's important because kind of in the old world, what happens is that every user wants to get the data into, you know, their own exact shape and form that they like before they do their analysis. And that indeed involves downloading the data into a CSV file or a spreadsheet or kind of creating a copy of data. Whereas in Dremio, that's not required. Every user can create the data. kind of massage the data, get it into some other form and save that as a virtual dataset with literally zero overhead in the system. We're not materializing those virtual datasets.

SE Daily:

I see, so there are essentially saving their queries and the query, when they decide to run it, becomes a materialized view, but until then it's just a query, which in a sense is a virtual dataset.

Tomer:

Right. These virtual datasets are essentially defined by a select statement in SQL. And of course you can, you can define virtual datasets that are built on top of other virtual datasets. So as an example, you may have a virtual dataset that is a join between a hive table and an elastic index. And then another virtual dataset that maybe selects only the records from that first virtual dataset and filters them on the city equals mountain view.

So you can have that kind of basically, data graph evolve over time of these virtual datasets defined based on other virtual datasets. And what's nice is that these virtual datasets are exposed when a BI tool, such as Tableau, for example, connects to Dremio, all these datasets, whether they're physical or virtual, are exposed to Tableau as tables that the Tableau user can then, you know, play with, they can start analyzing and they can start visualizing creating charts and dashboards and stories and so forth.

SE Daily:

It sounds like these virtual datasets are kind of like stored procedures but when we change the way that we're referring to that abstraction, then you can start to build a different product around that. I mean, you start to build a product with the idea that this is a virtual dataset. It's not a query and, you know, I guess it becomes easier for people to think about merging virtual datasets together rather than running queries on top of one another. Or maybe you could help me understand the difference in terminology because it sounds like a virtual dataset, it's kind of just like a stored procedure.

Tomer:

It's actually similar to a view in a database. So if we're talking about one relational database, then there are views in that system. Now user typically defined by, you know, like a DBA or somebody who's pretty technical. In the case of Dremio, these virtual datasets can be defined either through a SQL statement, a select statement or by tracking visually with the data. So there's a user interface that's similar to Excel and allows them to click on a column and say, I want to drop this column and select the zip code in an address column and say click on extract, and we figure out how to extract that into a new column. And all that's doing underneath the hood is, is kind of modifying that SQL definition of that dataset. So effectively, you're creating views of the data. These are virtual datasets.

SE Daily:

What's a Dremio Reflection?

Tomer:

So we've talked about kind of the notion of virtual datasets and that's what users see in the product. That's what they interact with, that's what they share with their colleagues, that's what they analyze when they connect a BI tool or something like R or Python to Dremio. They play with these virtual datasets but at the end of the day, you know, these users expect high performance, right? So if something is really easy but it's slow, people don't want to use it.

And when we looked at, okay, achieving the holy grail of analytics, a system basically, that allows you to interact and analyze any data at any time. You know, part of the problem with solving kind of the logical aspect, make it easy for people to find things and collaborate, create new datasets and so forth. But the other aspect is how do you do that fast, right? And there are a lot of challenges in that regard because oftentimes the-

Tomer:

In that regard, because, often times the data is in a system that just physically won't let you create fast, right? If the data is an Elasticsearch index and you're trying to join two indexes, and that requires scans of these indexes, well, if that system can only do 20,000 records per-second per-core it's only so much speed that you can have when you're doing that kind of analysis. And so, really what users want is they want a response time of up to a few seconds, in most use cases, regardless of how big the data is. And, that's where the reflections come into play. That's, basically, our unique IP that we've developed to allow us to provide interactive speed queries regardless of the size of the data and the location where the data lives. And so, that's a lot of the magic of the system from a performance standpoint.

SE Daily:

I want to ask a little bit about the proprietary stuff. I know you probably can't tell me exactly what it's doing. My sense is that it's a complex caching system that does, maybe, some eager loading in certain situations, something like that?

Tomer:

Yeah, I think that's fair. It's a complex caching system. Now, what we're caching is not just the typical, a traditional cache is caching copies of the data, right? In our case, the reason these are called reflections is because we are caching different reflections, or different perspectives of the data. So, we may cache the data in different shapes and forms. For example, we may cache a given data set pre-aggregated by various dimensions. We may cache it, sort it in specific ways, or partition in specific ways. And so, these are what we call these data reflections. One of the hard parts here is when a query comes in, let's say, from a BI tool like Tableau, or Qlik, or Power BI, we then, our cost-based optimizer has to look at that query, compile it, and say, "Okay, how can I reduce the cost of executing this query by leveraging one or more reflections?" So, internally, we are rewriting that query plan, if it's possible, to leverage the reflections rather than scanning the raw data in the source system again, and again, and again, and again. And, that's how we get the performance.

And so, you could think of it as when you go on Google and run a search query it would be very slow if Google had to then go and scan all the world's web pages, right, that would be slow. So, instead, they've created an index and they've created various models where they have already organized the data internally in various shapes that allow them to retrieve answers to specific queries very, very fast. And, that's what we do at Dremio, of course with different types of data structures that are more suitable for analytical queries. And so, you could think about, in the old world you had things like cubes, and projections, and indexes in relational databases, and all those types of techniques. If you combine those into one system and you put it behind-the-scenes where the end user doesn't even need to know about it, but it's the system deciding which one of these representations, or maybe multiple of them to use at query time, that's how we get that performance.

SE Daily:

Let's talk about some of the intelligence that goes into that reflection building. I don't know the way to approach this but, maybe, I'm a data analyst, I've got a MySQL database, an Elasticsearch cluster, I've got three or four other data sources. I build some virtual data sets and if I was naively requesting those data sets at different points throughout the day it would take forever to build those virtual sets, because they're just large sets of data. It takes a while for it to query them and pull them from disk into memory. But, with the Dremio, if I've got this Dremio Reflection, it's going to intelligently have some of that information cached, it's gonna make it accessible to me in my BI tool a little more aggressively. What are you doing that's intelligent that gets that data into something, gets things going faster? Just give me some guidance for what you're doing.

Tomer:

Yeah. So, let's start with where do these reflections live? So, the reflections live in a persistent store and we have three options. We can leverage S3 to store these reflections, we can leverage HDFS, or a new cluster, any kind of a new cluster for these reflections. And then, we can also just leverage the local disks of a Dremio cluster and strike them across that. And, that's good for cases where you don't have Hadoop and you're not running in the cloud. And so, you can do that as well.

And so, then, that's where the reflections live and we are maintaining them on, let's say, an hourly basis, or whatever your SLA is. So, you can define that on a per-data source as well as a per-data set level and say, "Okay, I'm willing for this data to be, at most, one hour seal." And then, our engine makes sure that we're maintaining that reflection on that schedule and updating it either by doing full updates, or incremental updates to maintain these reflections. And then, when your query comes in from the BI tool the query says, "You know what? I want to see all the, just count the number of events based on city," so per aggregating by city. And, one of our reflections in the cache may be the data already aggregated by city, state, and neighborhood. Well, we can roll that up and give you the answer that you asked for by going through, maybe, a million records instead of going through a trillion records. And that, of course, gives you that many orders of magnitude and faster response time.

SE Daily:

So, you have this scheduled job that pulls data from the virtual data set into the Dremio Reflection, and the Dremio Reflection will give you a faster, it's the materialized view of a virtual data set, right?

Tomer:

That's correct, and it's at a high level. So, one of the important nuances here is that the reflection doesn't have to, a single reflection could actually accelerate queries on many different virtual data sets. Because, as you may recall, these virtual data sets can be very much related to each other, they could be derived from one another. And so, at the end of the day when a SQL query comes into the system our optimizer doesn't really care about the virtual data set, it expands all those definitions and looks at the foundational relational algebra and says, "Okay, how can I massage this query plan, canonicalize it, figure out whether there is or there isn't a reflection, or multiple reflections that I can use to satisfy this query more efficiently?" And so, it's not necessarily one-to-one between the virtual data set and our reflection. There can be many reflections associated with a virtual data set, but even those reflections could also accelerate queries on hundreds of other virtual data sets.

SE Daily:

You said there's a scheduled time, or some sort of SLA where the reflections get updated with the most recent pieces of data that have been added to the data sets that the virtual data sets are referring to. Does that mean that if I load my reflection into my BI tool it may not represent the most up-to-date version of the data?

Tomer:

Yeah, that's correct. So, like with any caching system you're basically saying, "I'm okay with looking at data that is one hour old," right? That's the trade-off that you're making here. Now, for most companies today it takes them months to create a new visualization. So, for them, it's a no-brainer, right? We do have some users where that SLA, if they have defined in our system is one minute, so we have, for example, a use case that involves IOT data and it's a predictive maintenance use case where it really matters that it's very much up-to-date. And so, they do these refreshes on a minute by minute basis. But, for the most part, people seem to settle on the, maybe, one hour timeframe, or something around that range is what they prefer.

SE Daily:

It seems like maybe you could also time the reflections to be up-to-date when the data scientist sits down to do their daily or weekly analysis. I guess, I don't know much about how data analysts work. So, maybe they don't work that way. Maybe, they do more ad hoc, inspirational work.

Tomer:

Yeah, you could certainly, if somebody is, they come in every morning at 9:00 AM and that's when they usually work for an hour, you could certainly set it up so that you can trigger it, or even using the API call to trigger, or refresh, the time that you want. There are also all sorts of sophisticated capabilities here around how these reflections are maintained. And so, sometimes you may have a situation where you have different reflections that can be built off of each other rather than each of them being built from the source data. And so, then we internally maintain this. You can think of it as a dependency graph of reflections and we'll automatically figure out what's the right order in which we should refresh these reflections so that we're minimizing the amount of load that we put on the, in the case of an operational database, on that operational database.

SE Daily:

Tell me a little bit about building a reflection, as far as you can go without revealing the secret sauce.

Tomer:

Sure. So, the reflections are actually stored in, we use the Apache Parquet format, a combination of that and some elements from Apache Arrow, and then added some optimizations on that. So, that's basically the format on-disk of each of these reflections. And, we actually allow the user, the administrator of the system, or somebody who has a good knowledge of the data to go and tune these reflections. So, they can say, "You know what? This set of queries," I just got a phone call from this business analyst in the marketing department saying that he's running some queries and they're too slow. And so, as an administrator of the system I could say, "You know what? Let me add a new reflection that's optimized for that type of workload." And, once I've defined that, and that may take me a minute, or maybe two minutes in the system, just a few clicks, that marketing user will now have very, very fast response time and they won't have to change any of their client application, or if they're using something like Tableau, they won't have to change the worksheet or their dashboard. It's entirely transparent to the end user. So, that allows the administrator of the system to be fine-tuning these reflections.

And, that's important because while we have some kind of automated capabilities we also have a voting system where users can vote for things that they want to be faster. At the end of the day there are things that we don't know. So, one of our customers, for example, the CFO has their own set of reports that they look at every day and it's just one person. So, it's not something that's very frequent in the system, there's no way we could've known that that was important. But, because they are the CFO they are naturally important. So, we want to provide that kind of flexibility to users to be able to control which reflections exist in the system and they can add and remove them as they want. So, it's actually very easy.

SE Daily:

If I recall, the Apache Arrow project is for in-memory data sets interoperability. So, you should be able to share your data that is in Python, in-memory in a Python, so you're doing something with Python like in pandas' data science stuff and it's sitting in memory, you should be able to shift that data to Spark and do stuff in-memory with Spark. Or, just have that data sitting in memory and you could run Spark operations on it, you could also run Python operations on it, and Arrow is the format that allows for that interoperability, is that right? Am I accurately describing Arrow?

Tomer:

Yeah, that's exactly right. So, Arrow is all about having a column near in-memory technology for representing data and memory, and processing it, and leveraging the modern CPUs and GPUs. And, it's also a standard, right, a way for different systems to share the same type of memory structure.

SE Daily:

At Software Engineering Daily we need to keep our metrics reliable. If a botnet started listening to all of our episodes and we had nothing to stop it our statistics would be corrupted. We would have no way to know whether a listen came from a bot, or a real user, and that's why we use Incapsula to stop attackers and improve performance. When a listener makes a request to play an episode of Software Engineering Daily Incapsula checks that request before it reaches our servers, and it filters the bot traffic preventing it from ever reaching us. Botnets and DDoS attacks are not just a threat to podcasts, they can impact your application too. Incapsula can protect API servers and microservices from responding to unwanted requests. To try Incapsula for yourself go to incapsula.com/2017podcasts and get a free enterprise trial of Incapsula.

Incapsula's API gives you control over the security and performance of your application, and that's true whether you have a complex microservices architecture, or a WordPress site like Software Engineering Daily. Incapsula has a global network of over 30 data centers that optimize routing and cache your content. The same network of data centers are filtering your content for attackers and they're operating as a CDN, and they're speeding up your application. They're doing all of this for you and you can try it today for free by going to incapsula.com/2017podcasts and you can get that free enterprise trial of Incapsula. That's incapsula.com/2017podcasts to check it out. Thanks again, Incapsula.

So, you've got data in a MySQL database, and in an Elasticsearch cluster, and three other data sources, and you want to get that from a physical data set that has been earmarked as a virtual data set that people like, and you want to pull it into a reflection you, maybe, want to get all that data into the Arrow format so that it's all interoperable, and then you get the Arrow format put into Parquet, so it's in an on-disk consistent format. And then, you put the on-disk format into a reflection, is that right?

Tomer:

Yeah. And so, everything in Dremio, as soon as it leaves the disk, or the source system, it becomes Arrow format. So, all of our internal execution is based on Arrow. And, that means that as soon as we read a batch of records from Elasticsearch, or from HDFS, immediately that becomes a batch of Arrow in-memory. And, as it's executing in our entire execution engine it remains, it goes from one Arrow buffer to another Arrow buffer as it's going through the different operators. And then, you can also take the results of an execution and the goal really is to be able to use something like Python, and pandas supports Arrow as it's high-performance memory representation. So, now imagine being able to take the result of a joined, whether it's just the joining of two tables in Hadoop, or S3, or a joining across different systems and be able to do that analysis in pandas without having to de-serialize and serialize data. So, that's really the goal with Arrow and it's something we, at Dremio, put out open-sourced about, I want to say a year ago, and has since become the standard memory representation for pandas and Nvidia's GPU data frame, and some of the GPU databases now use it as their in-memory representation as well. So, it's quickly becoming what we had hoped, which was to create this industry standard way to represent data and memory for analytical use cases.

SE Daily:

And so, when those reflections get pulled out into the BI tool, so are the reflections getting pulled out into Arrow and then being read by BI tools?

Tomer:

One of the things we've done is we've developed a very high-performance translator from Parquet to Arrow. And so, the reflections are stored in Parquet because there are efficiencies in Parquet that we can leverage, especially with how we use Parquet to store the data in a very, very, highly compressed way. But, once we want to read that data, immediately we read it into the Arrow format. And, as more and more of these client applications such as Python and NR, but in the future also various commercial BI tools embrace Arrow as their way of ingesting and leveraging data. And, some of them are actually working on that right now. You'll have an extremely high-performance way to move data into those systems as well without having to go through the traditional ODBC protocol.

SE Daily:

I see, I see. So, today, maybe things are not as fast as they will eventually be because, just to refresh people, you've got, okay. So, the end to end explanation. You've got a MySQL database, and Elasticsearch cluster, three other data sources, everything's got different latency, all the data has different structures. You go into Dremio, you label some of those physical data sets as virtual data sets that like, "Oh, I'm gonna want this joined between my MySQL database and the data in Amazon S3, or RDS, or something," and you get that joined labeled as a virtual data set. On some scheduled basis that virtual data set gets translated into a Dremio Reflection, which is a materialized view that sits on-disk, basically cache on-disk, so that you can access it faster, so you don't have to do the entire join on the fly, you've just basically got it cached in on-disk. And then, it's cached on-disk in Parquet format, which a columnar format. And, people who want to go back and listen to the episode I did with Jacques to learn more about columnar data they can do that. We went really into detail on the Parquet and Arrow stuff.

So, it's sitting on-disk in Parquet. When you want to access it from a BI tool, today, you might need to figure out how to translate, well, or maybe there's some plugin. I think you said ODBC, I don't know much about what that is, open database something? Anyway, so in order to pull that from the Parquet file into your BI tool there is some degree of latency, or something, because there's not an easy Arrow translation layer. Maybe, you could disambiguate that process, what you were referring to?

Tomer:

Yeah. So, today, if you're using a client application on your desktop, whether that's, let's say, Excel, or if you're using something like Tableau, right, those tools typically use an API called, I don't know, a local API called ODBC. There's another one called JDBC for job applications and that's how they want to talk to databases. And so, then you have an ODBC driver for each database on that client's machine. Which means that, in the case of Dremio, we can maintain everything in Arrow all the way to the client over the network, but then it does have to get translated from Arrow into that ODBC API so that these traditional BI tools can use it. That doesn't have to happen for things like data scientists who are using, let's say, pandas. Because, pandas now supports Arrow as a native memory representation. It can actually do operations on that. And, Wes McKinney, who is the creator of pandas is actually one of the primary contributors on the Arrow project.

And so, for some tools like Python, for example, you don't need to go through that translation. While, for other tools that are, let's say, Window applications like Tableau, you do need to go through that translation. Over time, we expect that more and more of these client applications will simply embrace Arrow natively and they'll not have to go through that translation layer that they all have to go through today. But, that's an ongoing thing, I think, for, depending on the use case, it may or may not matter so much. So, often times for BI tools, because the amounts of data that are being shown in the tool usually aren't very big, right? At the end of the day, it's painting dots that a human eye, in the case of something like Tableau, the human eye needs to be able to see all these dots on that report. So, it doesn't really help that there are a billion dots. You can't really visualize that.

So, most of the time these data sets are much, much smaller, because what's happening on the backend is that Dremio is already getting the instructions from Tableau to aggregate the data by city, and state, and customer. And so, we're already doing that aggregation and just sending back a much smaller amount of data.

SE Daily:

What's interesting about this project to me is, I think, you and I first talked a couple years ago when we were talking about Drill. We did this show about Drill and Dremio was just a splash page with very little information, and now you've got a full-fledged project. And, it makes total sense to me, it's unlike anything I know of. Maybe, there's something internally at Google where they use something like this. But, it sounds like a pretty cool and differentiated project where you've got a moat in the sense that you've got this caching system that you've developed, and I'm sure that'll get more sophisticated over time.

SE Daily:

... get more sophisticated over time. I'd love to know, did you know that this was the product two years ago or was there some finagling with the strategy?

Tomer:

I mean, we kind of knew at a high level what we wanted to achieve. I was one of the early employees at MapR, I was Lead PF Product there, spent five and half years, and one of the things I observed over those years, kind of the emergence of big data, was that it was way too hard. At the end of the day, people wanted to leverage data but they then had to go higher, twenty data engineers and train everybody and it was very hard to get value out of the data, especially to enable non-technical users as well to get value out of the data. So when we looked at that, we said, "Why is it so hard for companies to leverage their data?" That was kinda of the initial question we were trying to answer and as we started interviewing more and more of these companies, and especially the larger global 2000, many of the brands you're familiar with, it became clear that it was just so many different point solutions and legacy technology that they had to deal with and if somebody really could develop a system that would abstract all of this away and provide performance without making the users have to think about it, that could really be a game changer in analytics and that could finally enable companies to start capitalizing on data they actually already have in various places; that was the goal.

Technology is different; definitely evolved over time. We had to build an entire engine from the ground up for this purpose, we couldn't use any existing SQL engine or Drill or any of those things that none of those met the needs for the performance and the types of things we're doing with Reflections. So we built that and we ended up building Arrow as part of that and actually open-sourcing that along the way. Now, when we launched the company, we also said "You know what," ... All of our executive team actually comes from companies like MapR, MongoDB and Hortonworks and has a ton of experience with open-source and we think that's the right strategy in 2017. Dremio, is itself, an open-source project now that many companies are downloading every day and putting to use.

SE Daily:

The Dremio open-source project itself, meaning: thing that pulls from parquet files, basically, or what is the Dremio open-source project?

Tomer:

Yeah, actually, everything we talked about today. That's all open-source, you can go to dremio.com or get UpPage and download the software and run it on the cluster; either your Hoodoo cluster or you can run it in the cloud and connect to your different data sources and-

SE Daily:

Oh, I see. So, the thing that is not open-source is essentially the Reflection builder and so if you're not using the Dremio Enterprise product, the business model, you're pulling from your virtual data sets and you have to schedule yourself: when are these virtual data sets going to be materialized and when am I going to pull them into memory?

Tomer:

No, actually, the only thing that's not open-source ... the Reflections and the acceleration capabilities are all open-source as well, they're all a part of the Community edition. What's not in the Community edition is, first of all, the enterprise security, the ability to control to connect to LDAP, for example, for user authentication, the ability to control access to different data sets. Then the second thing that's not available in the Community edition is some of the connectors to things like Teradata and IBM Db2, some of these technologies that are much more enterprise-oriented.

SE Daily:

Oh, okay. Those connectors are between the Reflection and the client-side application or is it between the on-disc physical data set and the reflection?

Tomer:

Well, it's actually the ... Dremio will connect as a system. The first thing you'll do is, you'll go to our UI and you'll say, "Add Source" and you'll connect to your different data sources and that includes your elastic search clusters, your Hadoop cluster, your MySQL database and so forth. So you're connecting to your different data sources and so if you're using the Community edition you can connect to something like ten different data sources and there are just a few that are only in the Enterprise edition so that's kind of the difference there.

Regardless of which edition you're using, users can log in, they can create new virtual data sets; as an administrator, you can manage the Reflections that are being maintained behind the scenes and help with acceleration. So we think that any ... certainly, a start-up could get started with Community edition and use that in production at any scale up to thousand servers without a problem and not have to pay anything.

SE Daily:

I guess I'm still not quite ... I fully believe this business model makes sense, I'm just having a little trouble understanding it. So, where does the Teradata and the Db2 connectors or whatever ...

Basically, you described a world in which there is a long tail of ways to access your data and, unfortunately, you want to be able to do joins between your Teradata, or your Db2, or whatever databases were used in the Cobalt era, maybe that's a Db2, and you want to be able to work with these data sets just like your RDS cluster and your MySql. You want to use them like they're modern data sets and in order to do that you should have some sort of connector that plugs them into Dremio and I'm just having a little bit of trouble understanding what that connector is because that's part of the premium offering.

Tomer:

Yeah, so, let's take a step back. Dremio has connectors. As part of our software, we ship connectors to different data sources. Data sources are the Teradata, the Oracle, the SQL server, the Hadoop, that's three, and so forth. So we have a collection of different connectors and we're constantly adding more and more of these connectors and this is what allows our software to connect to different data sources to push down queries into them to be able to read data from them and so forth.

When ABI tool, for example, like Tableau or a tool like Pandas in the Python world, connects to Dremio, we look like a single relational database to that tool. So we're abstracting away all the different data sources and making it look like one relational database where you can join things together and so forth.

If you're using our community edition you're able to connect to Elasticsearch and S3 and Hadoop and Mongo and SQL server and so forth. But if you have a Teradata cluster you're not going to be able to connect to that because that's only reserved to the Enterprise edition.

SE Daily:

Okay, all right, I think I got it.

So, the motivation for building Arrow or starting the open-source Arrow project, that's pretty interesting because that was basically a way of crowd-sourcing the connectors between the on-disc representations of the physical data sets that you could label as virtual data sets. That was a way of crowdsourcing: how do we get those virtual data sets into our execution engine so we can model it however we want and put it into our Dremio Reflections?

Tomer:

Arrow, when we talk about crowd-sourcing, it's a long game. To create a world where everything in the industry uses one memory format, first of all, that's never going to happen entirely, but it's something that's going to take time. Over the last year we saw a variety of different technologies now embrace Arrow and that's of course helpful for us as well and just helpful, in general, for the industry.

SE Daily:

I didn't mean to phrase it like it's a machiavellian thing, it was just like, if we do these three things will that get the community going in a direction that really lifts our boat in the way that we ... 'cause it's going to raise all boats but we've built a really big boat.

Tomer:

Exactly. That's exactly right. Our goal was: Let's provide value. I mean, the only way you're going to get anything adopted by anybody else if you provide value to them. The fact that we provide Arrow, which is a great library for dealing with data in memory and solves a lot of problems that a bunch of people otherwise would hae to solve themselves, and they don't want to if there's something open-source they can just adopt. We put that out there. Why does the benefit Dremio? I think it indirectly benefits Dremio because Arrow is the memory format that we chose as well, of course. So by having other systems talk and use that same format that we have, we can then have over time, introper ability with different systems.

SE Daily:

We're talking about open-source now, I think it's worth taking a step back and looking at the history of this 'cause it's a pretty crazy lineage. I think I understand the lineage.

The Dremel project originally came out of Google and that's like Dremel, that's phonetically similar to Dremio, so I'm assuming Dremel is an ancestor of Dremio. Dremel was kind of this columner ... it's like a columner ... it's a system for doing faster analytics. Big query is based off of Dremel, that's Google's big query service, it's very popular and I think Drill had something to do with ... Was Apache Drill, wast that kind of the open-source version of Dremel?

Tomer:

It's hard to say an open-source version of something because Google builds project internally, they don't really open-source them.

SE Daily:

They white paper.

Tomer:

Right, they write a white paper.

SE Daily:

Until recently.

Tomer:

Drill is a SQL engine primarily focused on Hadoop and there are a number of others like Hive and Impala and Impresto that kind of serve a similar purpose of being a SQL engine. Dremio is a very different type of project, obviously, just from the fact that the whole acceleration of queries being able to accelerate queries by orders of magnitude and having a way for users to collaborate and curate data sets in the system and then also being able to push down queries into all these different types of systems. So it's a very different system. You're right, the name Dremio did ... well, we were looking for a short company name that had an available domain name and that was kind of hard to find and we ended up with that. It's the, I'd say, vision and kind of what we're doing is a lot broader than what these SQL engines were doing.

SE Daily:

It's catchy, I like the narawal. It's funny, you compare the Dremel strategy or basically the big table strategy where Google released these white papers and then the community would sort of shamble slowly towards the Google infrastructure and everybody's constantly ten years behind what Google has because Google just releases these white papers. You compare that to today, the Kubernetes and Tensorflo strategies, where they open-source it and everybody immediately adopts what Google is doing and then Google gets to, kind of similar to the Dremio/Apache Arrow thing, Google has built the biggest ship so they get, in the rising tide lifts all boats, they get the most value out of it. What do you think about that contrast between the white paper strategy and the open-source strategy? Do you think this is a shift in Google strategy or do you think they're going to be selective with the white paper versus open sourcing strategy?

Tomer:

I think it's definitely a shift and I think it started when they started investing more in their cloud infrastructure. They realized that. If they want to appeal to companies to leverage Google cloud, they need something to draw them there versus Amazon being, of course, the first player in the market and the largest player and Microsoft owning the enterprise, if you will. So what is their strategy? And I think they realize that if they can open-source some of these technologies, get developers to start using them and if they're the best place to host that Kubernetes environment or Tensorflo workloads then that gives them an advantage. So, yeah, I think it's a smart move on their behalf. I think they also observed what happened with some of these white papers where they ... going back kind of to the, let's say the MapReduce days, they wrote a white paper on MapReduce and then it was implemented years later in the open-source community. Of course, when Google game and said, "We want to provide a cloud service," they couldn't then offer their superior MapReduce service because it was a different API and everybody already built apps on the open-source version.

SE Daily:

I didn't realize that.

Tomer:

Probably lessons learned in terms of let's get people developing on our APIs early on.

SE Daily:

Hilarious. Sure is a great world for developers these days.

Tomer:

Yeah, it is. There's a lot of free stuff out there and you can download things and it's also why, even if you look at the world where we're in, at Dremio, where we're closer to that business analyst and data sign because where a lot of tools are proprietary we still thought the right strategy was to write something open-source because it encourages that kind of bottom-up adoption where people can download it, they love that, their ability to get started, they don't have to talk to a sales person, developers hate that, I hate that.

SE Daily:

Did you build a visualization tool? Your own visualization tool?

Tomer:

No, we built kind of a, from a visual standpoint, RUI looks like Google Locks for datasets. People can create new virtual data sets, they can collaborate, share them with their colleagues, kind of build on top of that. Then there's kind of a data set editor where they can curate the data, they can massage the data, but we don't provide that last mile visualization the way a BI tool would so we very much prefer to partner with companies like Looker and Tableau and Qlik and so forth.

SE Daily:

But that Google docs style stuff, all that's open-source?

Tomer:

Yep, that's all open-source.

SE Daily:

That seems pretty unique. That seems pretty differentiated. Nobody does that, right? Like Looker and Periscope Data and stuff, they don't do that stuff, right?

Tomer:

No, that's correct. Looker very much focuses on the analysis as opposed to getting the data ready so we actually partner very close with Looker and actually share a board member with them.

SE Daily:

Oh, of course. Okay. Interesting.

In that case, we're drawing to an end, so kinda close off with the contextual stuff, what is the most modern data scientist doing with Dremio? You talked a little bit about the Periscope Data/Looker set of tools, what are the tools that the most modern data scientist is using today and what are they doing with them?

Tomer:

Yeah, so another way to think about it is: there are companies whose goal it is to provide self-service visualization, or self-service data science. These are companies like Looker or Tableau or Microsoft Power BI, that's what they do. Dremio's goal is to provide self-service for everything underneath that so if in the past you needed to have a data warehouse and a bunch of ETL tools and cube building technologies and pre-aggregating data and extracting it, and all kind of stuff, we wanted to create self-service at the data layer for that entire data platform. Then you have an entire end to end stack that's self-service, both the visualization and that comes from Looker, Tableau, Power BI, etc, and everything underneath that comes from Dremio.

So the data scientist today will use ... that term data scientist is a little bit ... it's very broad, it means different people. For some people data scientist is somebody who writes SQL queries, for others it's somebody who uses a visual interface like Tableau or Looker and then for other people it's more kind of a machine learning person who builds models and deploys models and they may be using something like Python or R for that kind of use case.

SE Daily:

The Dremio business model, are you modeling yourself after ... 'cause I'm trying to think of an analog, it's not really like the MapR world that you come from. The whole MapR and cloud era and whatever ... there's also a third one that I'm forgetting-

Tomer:

Hortonworks?

SE Daily:

Hortonworks, right. It was interesting because that model of company, it became sort of a consultancy type of model but I think part of that is because this was so hard to do and everybody wanted Hadoop, everybody wanted the big data but nobody had any idea how to set it up and so you really needed this group of consultants to come in and help you set up and give you enterprise software that made things easier to use were kind of in a different world and Dremio does not need to be anything like that or am I getting it wrong?

Tomer:

Oh, you're a 100% correct. It was very refreshing to go from the Hadoop world where you sell somebody Hadoop and it takes them six months before they get any value out of it to a world where on the first day that somebody starts using the system they're already solving really hard business problems they were having for a long time. The spark that you see when we demo the product to people, when they see the actual demo, they love it. That's really refreshing so we don't sell professional services, there's no need for consultants, we can help them on the first day to install it and integrate with their systems but that's really all they need. But when it comes to the open-source and the business model, I would say, of the three, Hadoop vendors were most similar to CloudAIR, in that, we kind of take that Community edition which is an open-source Apache licensed version and then we have an Enterprise edition which has additional functionality that people pay for.

SE Daily:

Well, Tomer, this has been a fantastic conversation, really technical, interesting product but also very subtle business model. Love talking to you Dremio guys, so continue to do more shows, I'm looking forward to it.

Tomer:

Yeah, thanks so much for hosting me, I really enjoyed the conversation.