Dremio Jekyll

Dremio Overview With Tomer Shiran

Transcript

Interviewer:

The MapReduce Paper was published by Google in 2004. MapReduce is an algorithm that describes how to do large scale data processing on large clusters of commodity hardware. The MapReduce Paper marks the beginning of the big data movement. The Hadoop Project was an open source implementation of the MapReduce Paper. Doug Cutting and Mike [Caparella 00:00:23] wrote software that allowed anybody to use MapReduce as long as they had significant server operations knowledge and a rack of commodity servers.

Hadoop got deployed first at companies with the internal engineer teams that could recognize its importance and implement it, companies like Yahoo and Microsoft. The word quickly spread about the leverage Hadoop could provide. Around this time, every large company was waking up to the fact that it had tons of data and didn't know how to take advantage of it. Billion dollar corporations in areas like banking, insurance, manufacturing and agriculture all wanted to take advantage of this amazing new way of looking at their data. But these companies did not have the engineering expertise to deploy Hadoop clusters. Three big companies were formed to help bring Hadoop to large enterprises. [Cloudera 00:01:11] Horton Works and [Mapar 00:01:15].

Each of these companies worked with hundreds of large enterprise clients to build out their Hadoop clusters and help access their data. Tomer Shiran spent five years at Mapar seeing the data problems of these large enterprises and observing how much value could be created by solving these data problems. In 2015 11 year had passed since MapReduce was first published and companies were still having data problems. Tomer started working on Dremio, a company that was in stealth for another two years. I interviewed Tomer two years ago and when he still could not say much about what Dremio was doing, we talked about Apache Drill, which was an open source project relate what Dremio eventually built, I should say is an open source project.

Earlier this year two of Tomer's colleagues, Jacque and Adeu and Julie and LaDem came on going to discuss columnar data storage and intra operability. What I took away from that conversation was that, today data within an average enterprise is accessible, but the different formats are a problem. Some data is in MySQL. Some is in Amazon S3. Some is in Elastic Search. Some is in HDFS stored Parquet files and also different teams will set up different BI tools and different charts that read from a specific silo of data. At the lowest level, the different data formats are incompatible. You have to transform MySQL data in order to merge it with S3 data.

On top of that engineers doing data science work are using Spark and Pandas and other tools that pull lots of data into memory and if the end memory data formats are not compatible, then the data teams can't get the most out of their work. They can't share their data sets with each other. On top of that, at the highest level these data analysts that are working with the different data analysis tools creates even more siloing. Now I understand why Dremio too two years to bring to market. Dremio is trying to solve data intra operability, by making it easy to transform data sets between different formats.

They're trying to solve data access speed by creating a sophisticated cashing system and they're trying to improve the effectiveness of the data analysts by providing the right abstractions for someone who is not a software engineer to study the different data sets across an organization. Dremio is an exciting project, because it's rare to see a pure software company put so many years into upfront stealth product development. After talking to Tomer in this conversation, I'm looking forward to seeing Dremio come to market.

It was fascinating to hear him talk about how data engineering has evolved to today and some of the best episodes of Software Engineering Daily covered the history of data engineering, including an interview that we did with Mike [Cafarella 00:04:05] who was the co founder of Hadoop. We also did another episode called The History of Hadoop, in which we explored how Hadoop made it from a Google research paper into a multi billion dollar, multi company industry. You can find all these old episode if you download the Software Engineering Daily app for IOS and for Android. With these apps, we're building a new way to consume content about software engineering.

They're also open sourced at gethub.com/softwareengineeringdaily. If you're looking to get involved in our community and contribute to the open source projects, we would love to get your help. With that let's get onto this episode.

You are programming a new service for your users, or you are hacking on a side project. Whatever you're building you need to send email. For sending email, developers use Send Grid. Send Grid is the API for email trusted by developers. Send transactional emails through the Send Grid API. Build marketing campaigns with a beautiful interface for crafting the perfect email. Send Grid is trusted by Uber, Airbnb and Spotify, but anyone can start for free and send 40,000 emails in their first month. After the first month, you can send 100 emails per day for free.

Just go to sendgrid.com/sedaily to get started. Your email is important. Make sure it gets delivered properly with Send Grid, a leading email platform. Get started with 40,000 emails your first month at sendgrid.com/sedaily. That's sendgrid.com/sedaily.

Tomer Shiran is the CEO of Dremio. Tomer, welcome back to Software Engineering Daily.

Tomer:

Thanks for having me.

Interviewer:

The last time we spoke, we were talking about Apache Drill and in another episode I talked to your colleague, the CTO of Dremio is Jacques Nadeau and we talked about columnar data in that episode. In both of these episodes I knew that the two of you were working on this stealth company Dremio. I didn't know much about what you were building. Now that the product is out I want to take a top down approach. We'll discuss what Dremio is and then we'll discuss the technical topics that we discussed in the past two episodes and how they relate to the construction of this product.

To start off it's 2017 and we've got teams of data scientists, data engineers, data analysts. These data teams that are also working with software engineers. They've got tons of data and they have some problems managing and accessing and visualizing that data. What are some of the specific problems faced by these teams of data engineers and data scientists and software engineers?

Tomer:

If you think about out personal lives and how easy it is when we go home and we have a question and we go online and ask that question on Google. One or two seconds later we have an answer. We have this amazing experience with data in our persona lives and that extends to smart phones and we want to book travel. Within two minutes we book travel and it's very simple, but then we come to work and it often takes months to be able to answer a new question, or create a new visualization, especially when you get to the enterprise where data is distributed all over the place and kind of owned by different teams. A lot of work has to happen in order to make that data available for somebody to be able to ask questions to it on that data.

That's kind of the core problem and a lot of times you'll see companies go through lots of kind of ETO work, where they're extracting and transforming data and they have to figure out some kind of data warehouse where they can load that data into and make it available. It's just a lot of work and that takes months to do. That's a big challenge.

Interviewer:

Some of this data is sitting in Amazon S3. Some of it is sitting in Elastic Search. Some of it is sitting in Mongo. Is the data in all of these different places, is it in the right format to be queried by these data teams?

Tomer:

It may, or may not be. I think the world has moved to a place where we have lots of different types of data stores and each of these data stores is really optimized for building different types of applications. Developers that build an app on ... Web App may choose Mongo, because that's easier to build the app there, or for some other use cases Elastic Search and maybe put the log files on S3. Really they're kind of optimizing for what's the best place for me to put the data for the application that I'm trying to build, as opposed to for the type of analysis that somebody may later want to do with that data. That's a challenge. If you think of the old world, maybe it was possible to have all my data in one relational data base and I could just as easily query that data and do my analysis directly on that, let's say Oracle data base.

But that's obviously no longer the world, with today's kind of volume and variety and complexity of data, it's just way beyond a place where we can just have all data somehow magically in a relational ... One relational database and expose that to a bunch of BI tools. It just ... That's just not feasible anymore.

Interviewer:

Do we want to uniformly turn these data sets into a single access system with consistent latency, consistent formatting. One thing we could talk about is columnar data. I think we will talk about that. Is the goal of Dremio to uniformly turn these data sets into columnar data?

Tomer:

It's actually ... I would describe the goal really as self service data. Our goal at Dremio really is, if you think of this new world where the data no longer can realistically be in one place and one relational data base. At the same time you have this growing demand for self service access to the data from ... Everybody from the data scientist to the project manager and the business analyst and so forth. How do we create a way for these people to be self sufficient and be empowered to do whatever they want with the data, no matter where that data is, how big it is, what structure it's in. To do that we have to solve a variety of different problems that the traditional data infrastructure just doesn't deal with.

If you think about historically we've had data in different places. We would then have to ETL that data into maybe some kind of staging area like a data lake, or Hadoop cluster, or something like S3. Then querying directly on that kind of system is more often than not too slow. Companies will tend to ETL a subset of that data. Maybe the last 30 days or the last ... Some aggregate level of data in to a data warehouse. That's not fast enough, so they create cubes and pre aggregate into other tables in the data warehouse. Maybe they extract into the BI servers.

At the end of all that you have 10 different copies of the data and really a lot of manual work that has to be done by engineers, every time somebody has a question, or wants to do something new. We think that in order to achieve this world, where companies really want to leverage data, they want to be data driven. You have to create a system that empowers the end users, the data consumer. Whether they're a data scientist who's using Panda, or a business analyst using Tableau, how do you empower that user to do everything on their own and get the performance that they need, which often is subset and response time, even when the data sets are petabytes in size.

Interviewer:

I think we understand this from the high level product perspective. What are the features that you need to build, in order to make that data access easier? We're talking about a visualization product? Are we talking abut query language? Are we talking about some sort of dashboard with both of those things build into it? Are we talking about an API? What are the features that you need?

Tomer:

What Dremio provides and by the way Dremio is available an an open source project, as well as kind of enterprise solution. You can download it. There are basically two aspects to it. On one hand, if you think about most companies, there are different users that want to use different tools to explore and analyze the data, ranging from BI to Excel to more advanced things like RN and Python. We don't want to create a visualization tool, or something people use to analyze the data. They already have plenty of those tools. We do want to provide these data consumers the ability to access and analyze any data at any time. We provide a number of capabilities in that regard and that includes an integrated data catalog, where they can find the data that they want and use a search type interface for that.

We provide them with a visual interface that they can curate the data and create new virtual data sets and collaborate with their colleagues. At the end of the day we want to enable their existing tools. Whether that's Tableau, or Power BI, or Click Sense, or Python, to be able to connect the system and run the query and get a response in less then a second no matter how big the data is, or where it's coming from. When it comes to ... For the data consumer we want them to live in a logical world, where they feel that they can do anything with the data at any time.

Now at the same time we have to provide the execution and acceleration capabilities that will actually make that fast. That's where underneath the hood there's an entire sequel distributed execution engine, leveraging Apache Arrow. There is some acceleration layer, where we've pioneered something called Data Reflection, which can accelerate queries by orders of magnitude. Then there is this data virtualization there that knows how to talk to different data bases and push down queries, or parts of queries into these underlying data bases, whether their no sequel data bases like Elastic Search and MongoDB, or it's relational data bases like Oracle and sequel server and MySQL.

Interviewer:

You talked about a few of the technical concepts. The reflection concept and the virtual data set concept. What's the right order that we should approach these concepts to dive into them?

Tomer:

If you think about what the ... Let's say the business analyst is using the system, or the data scientist who is a user and they want to work with the data. They're never aware of reflections. I think we should first focus on from their experience, their dealing with data sets. You have the physical data sets, those are the things that are in the collections in MongoDB and the indexes in Elastic and the tables in Oracle and the high tables in Hadoop. Those are physical data sets and then we allow these users to create their virtual data sets. Basically views of the data and they can share them with their colleagues and build on top of each other and so forth.

Users always think about the world in terms of data sets and both physical and virtual data sets.

Interviewer:

Digital Ocean Spaces gives you simple object storage with a beautiful user interface. You need an easy way to host objects like images and videos. Your users need to upload objects like PDFs and music files. Digital Ocean built Spaces, because every application uses object storage. Spaces simplifies object storage with automatic scalability, reliability and low cost. But the user interface takes it over the top. I've built a lot of web application and I always use some kind of object storage. The other object storage dashboards that I've used are confusing. They're painful and they feel like they were built 10 years ago.

Digital Ocean Spaces is modern object storage with a modern UI that you will love to use. It's like the UI for Drop Box, but with the pricing of a raw object storage. I also want to use it like a consumer product. To try Digital Ocean Spaces go to to do.co/sedaily and take two months of spaces, plus a $10 credit to use on any other Digital Ocean products. You get this credit even if you have been with Digital Ocean for a while. You can spend it on spaces, or you can spend on anything else in Digital Ocean. It's a nice added bonus, just for trying out Spaces. The pricing is simple. Five dollars per month, which includes 250 gigabytes of storage and one terabyte of outbound bandwidth.

There are no costs per request and additional storage is priced at the lowest rate available. Just a cent per gigabyte transferred and two cents per gigabyte stored. There won't be any surprises on your bill. Digital Ocean simplifies the cloud. They look for every opportunity to remove friction from a developer's experience. I'm already using Digital Ocean Spaces to host music and video files for a product that I'm building and I love it. I think you will to. Check it out at do.co/sedaily and get that free 10 credit in addition to two months of Spaces for free. That's do.co/sedaily.

The virtual data sets are these in memory representations of the data sets, the physical data sets that are probably represented on disc?

Tomer:

Yeah the physical data sets are represented typically on disc in some source system. The virtual data sets really is ... It's not an in memory representation, it's just a logical definition. That's the beauty of this is that, you can then have 1,000 users creating these virtual data sets. There's virtually no cost to these things. Then you can can ... They can create as many as they want. At the end of the day that's important, because kind of the old world what happens is that, every user wants to get the data into their own exact shape and form that they like, before they do their analysis. That indeed involved downloading the data into a CSV file, or a spreadsheet, or kind of creating a copy of data.

Whereas in Dremio that's not required. Every user can create the data, kind of massage the data. Get it into some other form and save that as a virtual data set with zero overhead in the system. We're not materializing those virtual data sets.

Interviewer:

I see. They're essentially saving their queries and the query when they decide to run it becomes a materialized view. But until then it's just a query, which in a sense is a virtual data set.

Tomer:

Right and these virtual data sets are essentially defined by a select statement in sequel. Of course you can define virtual data sets that are built on top of other virtual data sets. As an example you may have a virtual data set that is a joined between a hive table and an elastic index. Then another virtual data set that maybe select only the records from that first virtual data set and filters them on the city equals mountain view. You can have that kind of basically data graph evolve over time of these virtual data sets defined based on other virtual data sets.

What's nice is that these virtual data sets are exposed when ABI tool such as Tableau for example connects to Dremio. All of these data sets, whether they're physical, or virtual are exposed to Tableau as tables that the Tableau user can then play with. They can start analyzing and they can start visualizing and creating charts and dashboards and stories and so forth.

Interviewer:

It sounds like these virtual data sets are kind of like stored procedures, but when we change the way that we're referring to that abstraction, then you can start to build a different product around that. I mean you start to build a product with the idea that this is a virtual data set, it's not a query and ... I guess it becomes easier for people to think about merging virtual data sets, rather than running queries on top of one another. Maybe you could help me understand the different in terminology, because it sounds like a virtual data set is kind of just like a stored procedure.

Tomer:

It's actually similar to a view in a data base. If we were talking about one relational data base, then there are views in that system. Now views are typically defined like a DBA, or somebody who's pretty technical. In the case of Dremio, these virtual data sets can be defined either through a sequel statement, a select statement, or by tracking visually with the data. There's a user interface that's similar to Excel and allows them to click on a column and say I want to drop this column and select a zip code in an address column and say click on extract. We fair how to extract that into a column.

All that's doing underneath the hood is kind of modifying that sequel, definition of that data set. Effectively you're creating views of the data. These are these virtual data sets.

Interviewer:

What's a Dremio reflection?

Tomer:

We've talked about the notion of virtual data sets. That's what users see in the product. That's what they interact with. That's what they share with their colleagues. That's what they analyze when they connect the BI tool, or something like RI Python to Dremio. They pull these virtual data sets. At the end of the day, these users expect high performance. If something is really easy, but it's slow, people don't want to use it. When we looked at, achieving the Holy Grail of analytics. A system basically that allows you to interact and analyze any data at any time. Part of the problem was solving the logical aspect. Make it easy for Google to find things and collaborate, create new data sets and so forth.

But the other aspect is how do you do that fast? There are a lot of challenges in that regard, because often times the data is in a system that just physically won't let you query it fast. If a data is in a Elastic Search index and you're trying to join two indexes and that requires scans of these indexes. Well if that system can only do 20,000 records per second, per core, there's only so much speed that you can have when you're doing that kind of analysis. Really what users want is they want a response time of up to a few seconds in most use cases, regardless of how big the data is.

That's where the Reflections come to play. That's basically our kind of unique IP that we've developed to allow us to provide interactive speed queries regardless of the size of the data and the location where the data lives. That's a lot of the magic of the system from a performance standpoint.

Interviewer:

I want to ask a little bit about the proprietary stuff.

Tomer:

Sure.

Interviewer:

I know you probably can't tell me exactly what it's doing. My sense is that it's a complex cashing system that does maybe some eager loading in certain situations, something like that?

Tomer:

I think that's fair. It's a complex cashing system. Now what we're cashing is not just the ... A typical, a traditional cash is cashing kind of copies of the data. In our case, the reason these are called Reflections, is because we are cashing different reflections, or different perspectives of the data. We may cash the data in different shapes and forms. For example we may cash a given data set pre aggregate by various dimensions. We may cash it sorted in specific ways, or partitioned in specific ways. These are what we call these data reflections. One of the hard parts here is when a query comes in, let's say from a BI took like Tableau, or Click or Power BI. We then ... Our cost based optimizer has to look at that query, compile it and say, "Okay how can reduce the cost of executing this query by leveraging one or more reflections?"

Internally we are rewriting that query plan, if it's possible to leverage the reflections, rather than scanning the raw data in the source system again and again and again. That's kind of how we got the performance. You can think of it as, when you go on Google and you run a search query, it would be very slow if Google had to then go and scan all the world's web pages. That would be slow. Instead they have ... They've created an index and they've created various models, where they have already organized the data internally in various shapes that allow them to retrieve answers to specific queries very fast.

That's kind of what we do at Dremio. Of course with different types of structures that are more suitable for analytical queries. You can think about on the old world you had things like cubes and projections and index and relational data base systems. All those types of techniques. If you combine those in to one system and you put it behind the scenes where the end user doesn't even need to know about it, but it's the system deciding which one of these representations, or maybe multiple of them to use at query time. That's kind of how we get that performance.

Interviewer:

Let's talk about some of the intelligence that goes into that reflection building. I don't know the way to approach this, but maybe I'm a data analyst. I've got a MySQL data base, an Elastic Search Cluster. I've got three or four other data sources. I build some virtual data sets and if I was naively requesting those data sets at different points throughout the day, it would take forever to build those virtual data sets, because they're just large sets of data. It takes a while to query them and pull them from disc into memory, but with Dremio, if I've got this Dremio Reflection it's going to intelligently have some of that information cashed. It's going to make it accessible to my BI tool a little more aggressively. What are you doing that's intelligent that gets that data into something ... Gets things going faster? Can you just give me some guidance for what you're doing.

Tomer:

Yeah. Let's start with where do these Reflections live. The Reflections live ... They live in a persistent store and we have three options. We can leverage S3 to store these Reflections. We can leverage HDFS, or a Hadoop cluster, any kind of Hadoop cluster for these Reflections. Then we can also just leverage the local discs of a Dremio cluster and stripe them across that. That's good for cases where you don't have Hadoop and you don't have ... You're not running the cloud. You can do that as well. That's kind of where the Reflections live and we are maintaining them on kind of, let's say an hourly basis, or whatever your SLA is. You can define that per data source, as well as a per data set level and say, "Okay I willing for this data to be at most one hour still."

Then our engine makes sure that we're maintaining that Reflection on that schedule and updating it, either by doing full updates, or incremental updates to maintain these Reflections. Then when you're query comes in from the BI tool, the query says, "I want to see all the ..." Just count the number of events based on city. Per aggregating by city and one of our Reflections in the cash may be the data already aggregated by city, state and neighborhood, or we can roll that up and give you the answer that you asked for, by going through maybe a million records, instead of going through a trillion records. That of course gives you that many orders of magnitude, faster response time.

Interviewer:

You have this scheduled job that pulls data from the virtual data set into the Dremio Reflection. The Dremio Reflection will give you a faster ... The Dremio Reflection is the ... It's like the materialized view of a virtual data set right?

Tomer:

That's correct and it's at a high level. One of the important nuances here is that the Reflection doesn't have to ... I mean the single Reflection could actually accelerate queries on many different virtual data sets, because as you may recall these virtual data sets could very much related to each other. They could be derived from one another. At the end of the day, when a sequel query comes into the system, our optimizer doesn't really care about the virtual data set. It kind of expands all those definitions and looks at the foundational, relational algebra and says, "Okay how can I kind of massage this query plan, [inaudible 00:30:00], figure out whether there is, or there isn't a reflection, or multiple reflections that I can use to satisfy this query more efficiently.

It doesn't ... It's not necessarily a one to one, between the virtual data set and our reflection. There can be many reflections associated with a virtual data set, but even those reflections could also accelerate queries on hundreds of other virtual data sets.

Interviewer:

You said there's a scheduled time, or some sort of SLA, where the reflections get updated with the most recent pieces of data that have been added to the data sets that the virtual data sets are referring to. Does that mean if I load my reflection into my BI tool, it may not represent the most up to date version of the data?

Tomer:

Yeah that's correct. Like with any cashing system, you're basically saying, I'm okay with looking at data that is one hour old. That's kind of the trade off that you're making here. For most companies today, it takes them months to create any new visualization. For them it's a no brainer. We do have some users where that SLA that they've defined in our system is one minute. We have for example a use case that involves IOT data and they're kind of ... It's a predictive maintenance case, where it really matters that it's very much up to date. They do these refreshes on a minute by minute basis. For the most part people seem to kind of settle on the maybe one hour time frame, or something around that range is what they prefer.

Interviewer:

It seems like maybe you could also time the reflections to be up to date when the data scientists sits down to their daily, or weekly analysis. I guess I don't know much about how data analysts work. Maybe they don't work that way. Maybe they do more adhoc, inspirational work.

Tomer:

Yeah, you could certainly ... If somebody ... They come in every morning at 9:00 am, that's when they do their work for an hour. You could certainly set it up, so that you can trigger it, or you can use an API call to trigger a refresh at the time that you want. There are also all sort of kind of sophisticated capabilities here around how these reflections are maintained. Sometimes you may have a situation where you have different reflections that can be built off of each other, rather than each of them being built from the source data. Then we internally maintain this kind of ... You think of it as a graph, a dependency graph of reflections and we'll automatically figure out what's the right order, in which we should refresh these reflections, so that we're minimizing the amount of load that we put on the ... In the case of an operational data base, on that operational data base.

Interviewer:

Tell me a little bit about building a reflection? As far as you can go without revealing the secret sauce.

Tomer:

Sure. The reflections are actually stored in a ... We use the Apache Parquet format, a combination of that and some elements from Apache Arrow and then added some optimizations on that. That's basically the format on disc of each of these reflections. We actually allow the user, the administrator of the system, or somebody who has a good knowledge of the data, to go and tune these reflections, so they can say, "You know what? This set of queries ..." I just got a phone call from this business analyst in the marketing department, saying that he's running some queries and they're too slow. I was an administrator of the system I could say, "You know what? Let me add a new reflection that's optimized for that type of work load.

Once I've defined that and that may take me a minute, or maybe two minutes in the system, just a few clicks. That marketing user will now have very fast response time and they won't have to change any of their client application, or if they're using something like Tableau, they won't have to change the worksheet, or their dashboard. It's entirely transparent to the end user. That allows the administrator of the system to be fine tuning these reflections. That's important, because while we have some kind of automated capabilities, we also have a voting system, where users can vote for things that they want to be faster.

At the end of the day there are things that we don't know. One of our customers for example, the CFO has their own set of reports that they look at every day and it's just one person. It's not something that's very frequent in the system. There's no way we could've known that that was important, but because they are the CFO they are naturally important. We want to provide that kind of flexibility to users to be able to control which reflections exist in the system and they can add and remove them as they want. It's actually very easy.

Interviewer:

If I recall the Apache Arrow project is for in memory data sets, intra operability. You should be able to share your data that is in Python in memory and a Python ... You're doing something with Python, like a Panda data science stuff. It's sitting in memory. You should be able to shift that data to Spark and do stuff in memory with Spark, or just have that data sitting in memory and you can run Spark operations on it. You can also run Python operations on it. Arrow is the format that allows for that intra operability, is that right? Am I accurately describing Arrow?

Tomer:

Yeah that's exactly right. Arrow is all about having a columnar in memory kind of technology for representing data in memory and processing it and leveraging the modern CPU and GPU and it's also standard. A way for different systems to share the same type of memory structure.

Interviewer:

At Software Engineering Daily, we need to keep our metrics reliable. If a bot net started listening to all of our episodes and we had nothing to stop it, our statistics would be corrupted. We would have no way to know whether a listen came from a bot, or a real user. That's why we use Incapsula, to stop attackers and improve performance. When a listener make a request to play an episode of Software Engineering Daily, Incapsula checks that request, before it reaches our servers and it filters the bot traffic, preventing it from ever reaching us. Bot net and D Dos attacks are not just a threat to podcast. They can impact your application too.

Incapsula can protect API servers and micro services from responding to unwanted requests. To try Incapsula for yourself go to incapsular.com/2017podcasts and get a free Enterprise trial of Incapsula. Incapsula API gives you control over the security and performance of your application and that's true whether you have a complex micro services architecture, or a Word Press site, like Software Engineering Daily. Incapsula has a global network of over 30 data centers that optimize routing and cash, or content. The same network of data centers are filtering your content for attackers and they're operating as a CVN and they're speeding up your application.

They're doing all of this for you and you can try it today for free by going to Incapsula.com/2017podcasts and you can get that free Enterprise trial of Incapsula. That's Incapsula.com/2017podcasts. Check it out. Thanks again Incapsula.

If you've got data in a MySQL data base and Elastic Search cluster and three other data sources and you want to get that from a physical data set that has been earmarked as a virtual data set that people like and you want to pull it in to a reflection. You maybe want to get all that data into Arrow, in the Arrow format, so that it's all intra operable and then you get the Arrow format put into Parquet, so it's in on disc consistent format and then you put the on disc format into a reflection, is that right?

Tomer:

Yeah. Everything in Dremio, as soon as it leaves the disc, or the source system, it becomes Arrow format. All of our internal execution is based on Arrow. That means that as soon as we read records, a batch of records from Elastic Search or from HDFS, immediately that becomes a batch of Arrow in memory. As it's executing in our entire execution engine, it remains ... It goes from one Arrow buffer to another buffer, as it's going through kind of the different operators. Then you can also take that, the results of an execution and you could ... The goal really is to be able to use something like Python. Panda supports Arrow as its high performance memory representation.

Now imagine being able to take the result of a joined ... Whether it's just a joining of two tables in Hadoop, or S3, or joined across different systems and be able to do that analysis in Panda, without having to de serialize and serialize data. That's really the goal with Arrow and it's something we at Dremio put out open sourced about I want to say a year ago and has since become the standard memory representation for Panda and videos GPU data frame and some of the GPU data bases now use it as their in memory representation as well. It's quickly becoming that what we had hoped, which was to create this industry standard way to represent data and memory for analytical use cases.

Interviewer:

When those reflections get pulled out into the BI tool, are there reflections getting pulled out into Arrow?

Tomer:

Yeah.

Interviewer:

Then being read by BI tools?

Tomer:

One of the things we've done is we've developed a very high performance kind of trans setter for Parquet to Arrow. The reflections are stored in Parquet, because there are efficiencies in Parquet that we can leverage, especially with how we use Parquet to store the data in a very highly compressed way. But once we want to read that data, immediately we read it in to kind of the Arrow format. As more and more of these client applications, such as Python and ARB, but in the future also various kind of commercial BI tools embrace Arrow as their way of kind of ingesting and leveraging data and some of them are actually working on that right now. You'll have an extremely high performance way to move data in to those systems as well, without having to go through kind of the traditional ODBC protocol.

Interviewer:

I see. Today maybe things are not as fast as they will eventually be, because just to refresh people you've got ... Okay. The intended explanation. You've got a MySQL database and Elastic Search cluster. Three other data sources. Everything has got different latency. All the data has different structures. You go into Dremio, you label some of those physical data sets as virtual data sets that like, "Oh I'm going to want this joined between my MySQL database and the data in Amazon S3, or RDS, or something." You get that joined, label it as a virtual data set. On some scheduled basis, that virtual data set gets translated into a Dremio reflection, which is a materialized view that sits on disc, so that you can access ...

Basically cash on disc, so that you can access it faster, so you don't have to do the entire join on the fly. You've just basically got a cash on disc and then it's cashed on disc in Parquet format, which is a columnar format and people who want to go back and listen to the episode I did with Jacque to learn about columnar data they can do that. We went really in to detail on the Parquet and Arrow stuff. Then it's sitting on disc in Parquet. When you want to access it from a BI tool, today you need to trans ... You might need to figure out how to translate ... Well, or maybe there's some plug in. I think you said ODBC, I don't know much about what that is.

Open data base, or something. Anyway, so in order to pull that from the Parquet file into your BI tool, there is some degree of latency or something, because there's not an easy Arrow translation layer. Maybe you could disambiguate that process, what you were referring to.

Tomer:

Yeah, so today if you're using a client application on your desktop, whether that's Excel, or if you're using something like Tableau. Those tools typically use a API called ... A local API called ODBC. There's another one called JDBC for job applications. That's how they want to talk to databases. Then you have kind of a ODBC driver for each data base on that client's machine, which means that in the case of Dremio, we can maintain everything in Arrow, all the way to the client, over the network, but then it does have to get translated from Arrow into kind of that ODBC API, so that these kind of traditional BI tools can use it.

That doesn't have to happen for things like data scientists who are using let's say Pandas, because Panda now supports Arrow as a native memory representation. It can actually do operations on that. [inaudible 00:44:06] can use the creator of Panda, is actually on of the primary contributors on the Arrow project. For some tools like Python for example, you don't need to go through that translation, while for other tools that are Window applications like Tableau, you do need to go through that translation.

Over time we expect that more and more of these client applications will simply embrace Arrow natively and they'll not have to go through that translation layer that they all have to go through today. That's an ongoing thing I think for ... Depending on the use case it may, or may not matter so much. Often times for BI tools, because the amounts of data that are being shown in the tool usually aren't very big. At the end of the day it's painting dots that a human eye, in the case of something like Tableau, the human eye needs to be able to see all these dots on that report.

It doesn't really help that there are billion dots. You can't really visualize that. Most of the time these data sets are much smaller, because what's happening on the back end is that Dremio is already getting instructions from Tableau to aggregate the data by city and state and customer. We're already doing that aggregation and just sending back a much smaller amount of data.

Interviewer:

What's interesting about this project to me is I think we ... You and I first talked a couple years ago, when we were talking about Drill. We did the show about Drill and Dremio was just a splash page with very little information. Now you've got a full fledged project and it makes total sense to me. It's unlike anything I know of. Maybe there's something internally at Google where they use something like this. It sounds like a pretty cool and differentiated project where you've got a moat in the sense that you've got this cashing system that you've developed. I'm sure that'll get more sophisticated over time.

I'd love to know, did you know that this was the product two years ago? Was there some finagling with the strategy?

Tomer:

I mean it's ... We kind of knew at a high level what we wanted to achieve. I had come from kind of a ... I was one of their employees at Map Bar. I was VP of product there, spent five and half years and one of the things I observed over those years, kind of the emergence of big data, was that it was just way too hard. At the end of the day people wanted to leverage data, but they then had to go hire 20 data engineers and train everybody and it was very hard to get value out of the data. Especially to enable non technical users as well to get value of the data. When we look at that and we said, "Why is it so hard for companies to leverage their data?

That was kind of the initial question we were trying to answer. As we started interviewing more and more of these companies and especially kind of the larger kind of global 2000, I mean the brands you're familiar with. It became clear that it was just so many different kind of point solutions and legacy technology that they had to deal with. If somebody could really develop a system that would abstract all of this way and provide performance without making the user have to think about it. That could really be a game changer in analytics and that could finally enable companies to start capitalizing on the data that they already have in various places.

That was the goal. The technology has definitely evolved over time. We had to build an entire engine from the ground up for this purpose. We couldn't use any existing sequel engine, or drill, or any of those things that ... None of those kind of met the need for the performance and the types of things we're doing with Reflection. We built that and we ended up building Arrow as part of that and actually open sourcing that along the way. Now when we launched the company, we also said, "You know what? All of our executive team actually comes from companies like Mapar and MongoDB and Horton Works and has a ton of experience with open source and we think that's the right strategy in 2017."

Dremio is itself an open sourced project now. That many companies are downloading every day and putting to use.

Interviewer:

Did Dremio open source project itself, meaning thing that pulls from Parquet files basically? What is the Dremio open source project?

Tomer:

Actually everything we talked about today. That's all open source. You can to dremio.com, or get Hub Page and download the software and run it on a cluster, either your Hadoop cluster, or you could run in the cloud and connect to your different data sources.

Interviewer:

Oh I see. Okay the thing that is not open source is essentially the reflection builder. If you don't ... If you're not using the Dremio Enterprise product, the business model, you're pulling from your virtual data sets and you have to schedule yourself when are these virtual data sets are going to get materialized and when am I going to pull them into memory?

Tomer:

Well no actually the only thing that's not open sourced, so the reflections and the acceleration capabilities are all open source as well. They're all part of the community edition. What's not in the community edition is first of all kind of the Enterprise Security. The ability to control ... To connect to LDAP for example for user authentication. The ability to control access to different data sets. Then the second thing that's not available in the community edition is some of the connectors to things like tera data and IBM DB2. Some of these technologies that are much more enterprise orientated.

Interviewer:

Oh okay.

Tomer:

Yeah.

Interviewer:

Those connectors are ... Those are like between the reflection and the client side application? Is it between the on disc physical data set and the reflection?

Tomer:

Well it's actually the ... Dremio will connect as a system. The first thing you'll do is you go to our UI and you'll see add source. You'll connect to your favorite data sources. That includes your Elastic Search clusters, your Hadoop cluster, your MySQL data base and so forth. You're connecting to your different data sources. If you're using the community edition, you can connect to something like 10 different data sources and there are a couple ... Just a few that are only in the Enterprise Edition. That's the difference there. Regardless of which edition you're using you can ... Users can log in. They can create virtual new data sets. You can ... As an administrator you can manage the reflections that are happening behind there, that are being maintained behind the scenes and help with acceleration.

We think that any ... Certainly a startup could get started with the community edition and use that in production at any scale up to 1,000 servers without a problem and not have to pay anything.

Interviewer:

I guess I'm still not quite ... I mean I fully believe this business model makes sense. I'm having a little trouble understanding it. Where does the tera data and DB2 connectors, or whatever ...

Tomer:

Sure.

Interviewer:

Basically, I mean you just basically described a world in which there is a long tail of ways to access your data. Unfortunately you want to be able to do joins between your tera data, or your DB2, or whatever data bases we're using, the Cobalt era. Maybe that's DB2 and you want to be able to work with these data sets, just like your RDS cluster and your MySQL. You want to use them like they're modern data sets. In order to do that, you should have some sort of connector that plugs them into Dremio and I'm just having a little bit of trouble understanding what that connector is, because that's part of the premium offering.

Tomer:

Let's take a step back. Dremio had connectors. As part of our software we ship connectors to different data sources. Data sources are the tera data, Oracle, sequel server, the Hadoop, S3 and so forth. We have a collection of different connectors and we're constantly adding more and more of these connectors. This is what allows our software to connect to different data sources, to push down queries into them, to be able to read them and so forth. For users of our ... Then when ABI tool for example like Tableau, or a tool like Panda in the Python world connects to Dremio, we look like a single relational database to that tool.

We're abstracting away all the different data sources and making it all look like one relational database where you can join things together and so forth. If you're using our community edition, you're able to connect to Elastic Search and S3 and Hadoop and Mongo and sequel server and so forth. But if you happen to have a tera data cluster, you're not going to be able to connect to that, because that's only reserved to the Enterprise Edition.

Interviewer:

Okay, all right. I think I got it. The motivation for building Arrow, or starting the open source Arrow project, that's pretty interesting, because you were ... That was basically a way of crowd sourcing the connectors between the on disc representations of the physical data sets that you could label as virtual data sets. That was a way of crowd sourcing, how do we get those virtual data sets into our execution engine, so that we can model it however we want and put it into Dremio reflections.

Tomer:

Yeah, Arrow when we talk about crowd sourcing, it's a long game. To create a world where everything in the industry uses one memory format, I mean first of all that's never going to happen entirely, but it's something that's going to take time. Over the last year we've seen a variety of different technologies now embrace Arrow and that's of course helpful for us as well and just helpful in general for the industry.

Interviewer:

No, I didn't mean to phrase it like it's a Machiavelli and thing. It was just like, what's the ... If we do these three things, will that get the community going in a direction that really lifts in our boat in the way that we need ... Because it's going to raise all boats, but we built a really big boat.

Tomer:

Exactly and that's exactly right. Our goal was let's provide value. The only way you're going to get anything adopted by anybody else is if you provide value to them. The fact that we provide Arrow, which is a great library for dealing with data in memory and solves a lot of problems that a bunch of people otherwise would have to solve themselves and they don't want to. If there's something open sourced they can just adopt. We put that out there. Why does that benefit Dremio? I think it indirectly benefits Dremio, because Arrow is the memory format that we chose as well of course. By having other systems talk and use that same format, we can then have over time better and better intra operability with different systems.

Interviewer:

We're talking about open source now. I think it's worth taking a step back and looking at the history of this, because it's a pretty crazy lineage. I think I understand the lineage. The Dremel project originally came out of Google and that's like Dremel that's phonetically similar to Dremio. I'm assuming Dremel is an ancestor of Dremio. Dremel was kind of this columnar ... It's like a columnar ... It's a system for doing faster analytics, big query is based off of Dremel. That's Google big query service, very popular and I think Drill had something to do with ... Was Apache Drill, was that the open source version of Dremel?

Tomer:

Well it's hard to say open version of something right, because Google builds projects internally, they don't really open source them.

Interviewer:

Right white paper.

Tomer:

Yeah they run a white paper. Drill [crosstalk 00:56:37] was a sequel engine, primarily focused on Hadoop and there a number of others like Hive and Impala and Presto that kind of serve a similar purpose of being a sequel engine. Dremio is a very different type of project obviously from the fact that the whole kind of acceleration of queries is being able to accelerate queries by orders mounted and having a way for users to kind of collaborate and curate data sets in the system. Then also being able to push down queries into all these different types of systems. It's a very different system.

You're right that the name Dremio did ... Well we were looking for a short company name that had an available domain name and that was kind of hard to find. We ended up with that, but it's the I'd say the vision and what we're doing is a lot broader than what these kind of sequel engines were doing.

Interviewer:

It's catchy. I like the [narawal 00:57:36]. I think it's funny, you compare the Dremel strategy, or basically the big table strategy where Google released these white papers and then the community would sort of like, shamble slowly towards the Google infrastructure. Everybody is constantly 10 years behind what Google has, because Google just releases these white papers. You compare that today, the Cooper [Netties 00:58:05] and [Tenter 00:58:05] flow strategies where they open source it and everybody immediately adopts what Google is doing and then Google gets to ... Kind of similar to the sort of Dremio / Apache Arrow thing. Google had built the biggest ship, so they get ...

When the rising tide lifts all boats, they get the most value out of it. What do you think about that contrast between the white paper strategy and the open source strategy that Google ... Do you think this is a shift in Google strategy, or do you think they're going to be selective with the white paper versus open sourcing strategy?

Tomer:

I think it's definitely a shift and I think it started when they kind of started investing more in their cloud infrastructure. They realized that, if they want to appeal to companies to leverage Google Cloud, they need something to draw them there, versus Amazon being of course the first player in the market and the largest player. Microsoft having kind of a ... Owning the enterprise if you will. What is their strategy and I think they realize that if they can open source some of these technologies, get developers to start using them. If they're the best place to host that, [Kumaneti's 00:59:17] environment, or Tensor flow work loads, then that gives them an advantage.

I think it's a smart move on their behalf. I think they also observed what happened with some of these white papers, where they ... Going back to the MapReduce days, they wrote a paper on MapReduce and then it was implemented years later in the open source community. Of course when Google came and said we want to provide a cloud service. They couldn't then offer their superior MapReduce service, because it was a different API and everybody had already built apps on the open source version.

Interviewer:

I didn't realize that.

Tomer:

Probably lessons learned in terms of let's get people developing on our APIs early on.

Interviewer:

Hilarious. Sure is a great world for developers these days?

Tomer:

Yeah it is. There is a lot of free stuff out there. You can download things and it's also why ... Even if you look at the world where we're in at Dremio, where we're actually closer to that business analyst and data scientist, where actually a lot of tools are proprietary. We still thought that the right strategy was to provide something open source, because it encourages that kind of bottom up adoption, where people can go download it. They love that. Their ability to get started. They don't have to talk to a sales person. Developers hate that. I hate that.

Interviewer:

Did you build a visualization tool? Your own visualization tool?

Tomer:

We built kind of a ... From a visualization standpoint RUI looks like Google docs for data sets. People can create new virtual data sets. They can collaborate, share them with their colleagues and kind of build on top of that. Then there's kind of a data set editor, where they can curate the data. They can massage the data, but we don't provide that kind of last mile visualization the way BI tool would. We very much prefer to partner with companies like Booker and Tableau and Click and so on.

Interviewer:

But that Google Doc style stuff, all of that is open source?

Tomer:

Yep that's all open source.

Interviewer:

That's seems pretty unique. That seems pretty differentiated. Nobody does that right? Like Looker and Periscope Data and stuff, they don't do that stuff right?

Tomer:

No that's correct. Looker very much focuses on the analysis, as opposed to getting the data entry. We actually partner very closely with Looker. We actually share a board member with them.

Interviewer:

Oh of course. Okay, interesting. All right. In that case let's kind of ... I guess we're drawing to an end. Let's kind of close off with contextual stuff. I mean what is the modern ... What is the most modern data scientist doing with Dremio ... We talked a little bit about the Periscope data / Looker, set of tools. What are the tools that the most modern data scientist is using today? What are they doing with them?

Tomer:

Another way to think about it is there are companies who's goal it is to provide self service visualization, or self service kind of data science. These are companies like Looker or Tableau, or Microsoft Power BI. That's what they do. Dremio's goal is to provide self service for everything underneath that. In the past you needed to have a data warehouse and a bunch of ETO tools and cube building technologies and pre aggregating data and extracting it and all that kind of stuff. We wanted to create self service at the data layers, so for that entire data platform.

Then you have an entire end to end stack that's self service. Both the visualization and that comes from Looker, Tableau, Power BI, et cetera. Everything underneath that comes from Dremio. The data scientist today will use ... That term data scientist is a little bit ... It is very broad right? It means [crosstalk 01:03:09] different things to different people. For some people data scientist is somebody who writes sequel queries. For others it's somebody who uses a visual interface like Tableau, or Looker. Then for other people it's more about a machine learning person who builds models and deploys models and they may be using something like Python or ARB for that kind of use case.

Interviewer:

The Dremio business model, are you looking at it ... Are you modeling yourself after ... Because I'm trying to think of an analog. It's not really like the Map R world that you come from, because the whole Map R ... The Map R and [Cloudera 01:03:46] and whatever ... There's a third also that I'm forgetting.

Tomer:

Horton Works.

Interviewer:

Horton Works right. It was interesting, because that model of company it became sort of like a consultancy type of model, but I think part of that was because this was so hard to do. Everybody wanted Hadoop, everybody wanted the big data, but nobody had any idea how to set it up. You really needed this group of consultants to come in and help you set up and give you Enterprise software that made things easier to use. We're kind of in a different world and Dremio does not need to be anything like that. Am I getting it wrong?

Tomer:

You're 100% correct. It was very refreshing to go from the Hadoop world where you seel somebody Hadoop and now it takes them six months before they get any value out of it, to a world where in the first day that somebody starts using the system, they're already solving really hard business problems that they were having for a long time. The spark that you see when we demo the product to people and they see the actual demo, they love it. That's really refreshing. We don't sell professional services. There's no need for kind of consultants. We can help them on the first day to install it and integrate with their systems, but that's really all they need.

But when it comes to kind of the open source and the business model, I would say of the three Hadoop vendors we're most similar to Cloud Era probably, in that we take the community edition which is kind of an open source, pass your license version and then we have an Enterprise edition which is additional functionality that people pay for.

Interviewer:

Okay. All right, Tomer, this has been a fantastic conversation. Really technical, interesting product, but also very subtle business model. Love talking to you Dremio that ... If you continue to do more shows I'm looking forward to it.

Tomer:

Yeah, thanks so much for hosting me. I really enjoyed the conversation.