Dremio Jekyll

Dremio 1.4 - Technical Deep Dive

 

Transcript

Kelly Stirman:

Okay, so we wanted to talk about Dremio 1.4. We shipped Dremio 1.0 back in July, and basically about once a month or so we've done a new release of the product that is primarily a stability and maintenance release, but we have continued to add features in every single release. Even though it says 1.4, this is a functional release with some new capabilities, which as I said, has been true for each of the releases we've shipped since 1.0. Here's a quick summary of what we've added in 1.4. Actually, a couple of these things are quite big and quite meaningful for some of our customers, and we'll go through each of these in a little bit of detail and talk a little bit about what's coming in the future as well. So, John, let's talk about the first one here. This is related to improved high availability. Why don't you talk us through the state of high availability prior to 1.4 and what's new in this release?

Dremio 1.4 Key Enhancements

Can Efeoglu:

Sounds good, Kelly. Prior to this release, Dremio had a mechanism where high availability was through an internal process, and there were several drawbacks to this. First of all, Dremio would rely on internal Zookeeper, which in most of our customer deployments was not the ideal method for providing this, and secondly, there would need to be an external process in releases prior to 1.4, which would detect the failover and manage ... sorry, detect the failure and manage the failover. In this release, however, this is actually done internally by Dremio, so you no longer need the external monitoring process to manage this. And on top of this, we rely on Zookeeper to handle this, which provides, first, a more robust detection capabilities around actual failures, as well as a faster overall failover time.

Dremio 1.4 Key Enhancements

Kelly Stirman:

Okay. And you know in distributive processes, there are stateless and stateful pieces of the architecture. What is stateful here in terms of metadata, and how is that being managed as well?

Can Efeoglu:

So, Dremio underneath the cover stores the metadata in its metadata store, which would be stored on a shared network drive in case of ... in a high availability setup. What's happening here is that we're seeing when a failover happens, Dremio ensures the consistency and integrity of the metadata internally and having the new master that comes available online connect with it directly, and the election process for the new master now is done through the Zookeeper logic.

Kelly Stirman:

Okay. Cool. Sounds good. Great. So let's talk about the next thing, which is Azure Data Lake Store, ADLS, as the cool kids like to call it. So tell us about ADLS.

Dremio 1.4 Key Enhancements

Can Efeoglu:

Yeah, this is one of the most exciting features for our expansion and the Azure deployment, so this has been a popular one among the customer base. So with this release you can now first of all query out of the data lake store, out of the Dremio stores, just like any other sources we support. So that bring the capabilities to have a much closer integration with the Azure community.

And second of all, which is also very exciting, is now you can store your reflections on Azure Data Lake Store. Which for deployments that are deployed in Azure is a key critical component for providing flexibility of having computer and storage for reflection separated so that you can more flexibly scale up and down.

Kelly Stirman:

Also and one of the nice things about the Dremio architecture is that reflections don't require high cost, high performance source technologies. On AWS, you can use F3 and I guess now on Azure you can use ADLS. You can take advantage of the cost benefits of these object stores, but without making compromises on performance.

Can Efeoglu:

Exactly.

Kelly Stirman:

So really what this does is it sorta gives users parody with respect to AWS. The same kinds of things we could do in AWS in terms of data sources and where to store reflections, you know have comparable capabilities on Azure, which is great to hear.

Okay, great. Let's talk now about some enhancements to a few of the sources that Dremio's been supporting since the initial release. Tell us about what's new in Elasticsearch.

Can Efeoglu:

So yeah, we've been actually working hard on growing our Elasticsearch integration categories and expanding support for what we call pushdowns. And pushdowns are basically Dremio capability of delegating processing of parts or full scope a work load down into the sources. So in this specific case, we now support pushing down not like operator as well as the like operator.

And now we also have improved logic where even if it's a part of a query can be pushed down, we will push that down and process rest of the workload in Dremio. Previously we were less intelligent about it in our logic, so now they make logic smartly divide up the work. For largest variety of workloads and ensure that whatever we can delegate, we will delegate so we can ensure optimal performance working with Elasticsearch.

Dremio 1.4 Key Enhancements

Kelly Stirman:

Great. What is the state of things with Elasticsearch in terms of versions that we support with Dremio?

Can Efeoglu:

So the current product supports 2x and 5x so any of those versions that we might have. We are actively working on improving our support for version six and that should be available in the upcoming versions as well.

Kelly Stirman:

Okay, great. So I guess with ES6, there were some non trivial changes that were breaking for some technologies like Dremio and we just haven't quite caught up.

Can Efeoglu:

Exactly there were some breaking changes in some of the APIs that we collected. With them we wanna make sure we take the product through a diligent query process to ensure optimal performance for while we do these things. And as you know in Elasticsearch, when we were supporting Elasticsearch 5, we added painless support which was a critical piece in ensuring higher performance than what was available with our integrations with 2x.

So you wanna make sure that that performance carries forward with version six.

Kelly Stirman:

And if people try Dremio with 6.0, it's a little tricky 'cause some things will work just fine. And it's not obvious what will and won't work so you may see things work on six, but it's not something that is, it's not comprehensively supported so hold tight and we'll have that updated before too long.

Okay let's talk about Mongo and Hadoop. What is new with MongoDB?

Can Efeoglu:

So for Mongo, we actually improved greatly our experience just with working with transformations and seeing immediate results. Previously we would rely on Dremio's own mechanism of sampling data from Mongo. Now you're actually using the sample space for Mongo that comes in Mongo itself for versions 3.2 or later which has greatly improved first the performance of getting previews from a collection and second the overall response times that you would expect to see while working with MongoDB sources.

Dremio 1.4 Key Enhancements

And as for the HDFS and part of our site, we also, based on our feedback from the community and our customers, we now have a capability for restricting access to a certain task. So certain route for file systems. And this actually helps serve scenarios where you wanna limit. For example you might have three different paths for a given data source that you wanna make available to different types of groups or different sets of business units in an enterprise. And this gives you more flexible and independent management of such different tasks and more flexibility on how you can define access function.

Kelly Stirman:

Okay. I really like this enhancement to Mongo personally. I remember the sample stage in the aggregation pipeline was when I was at Mongo this was one of the features that I advocated for strongly. A smart way to do sampling of data in Mongo. So now we're taking advantage of that feature. It does randomly sampling. Randomly walks the beat tree to find different beats on records and it's also vastly more efficient in terms of applying that workload to Mongo, so that's a great enhancement.

Great stuff on HDFS and MapR-FS as well. Nice.

So let's talk a little bit about performance. What's changed in performance for users of Dremio?

Can Efeoglu:

So you know how we can get back performance through main three ways. So one of them is how we deal in the execution performance and the second one is how we would generally deal with metadata caching and performance around that. So let's start with execution side first.

Dremio 1.4 Key Enhancements

So as you know, Dremio is, we're a major part of the Apache Arrow project and we're going to continue improving both through the project itself and how Dremio integrates with Arrow internally in its execution engine. So one of the more exciting things we've done this release is we've actually updated the Java implantation of a Arrow and incorporated a lot of these enhancements back into Dremio. So these will mean in real life workload this will mean things like less heat usage for a variety of execution profiles. As well as just improved performance overall in terms of speed and time.

And on the metadata side, Dremio already has an advanced metadata caching engine where we sort a variety of different data sources. So this means things like Dremio can intelligently expire certain cache versions of metadata based on user input. Dremio can intelligently cache metadata in certain intervals to ensure that it's up to date based on your SLAs.

So with this improvement, we actually improved the way we cache and invest that metadata. So even if you're having requests from say BI tools or any external client that's requesting metadata or Dremio's own internal execution, Dremio will leverage more indexing and filtering based on those indices as opposed to having to go to the individual systems or having to scan through all Dremio's internal metadata.

Kelly Stirman:

This is a really useful and important when connecting to big sources with hundreds of thousands of partitions. Thousands of tables on a database, it really matters.

Can Efeoglu:

Exactly. And this is actually based on what we were observing in the variety of customer deployments we had were we were doing the pie tables, for example, with hundreds of thousands of splits for files and you have on delivering interactive experiences even inside situations.

Kelly Stirman:

So as I recall, on the notes to the community about this release, I provided a link to a blog cast we did outlining the performance enhancements to Arrow on TPC-DS or TPC-H, I can't remember which. But these were not trivial changes in efficiency. Anywhere from 20 to 80% less, or 20 to 80% improved latency on these queries which right across the board.

So I also think knowing the team that was working on this, there's still a lot of runway for us to keep improving the performance of Arrow, which is already exceptional. So it's an exciting area I think we'll have more to talk about in future releases.

Can Efeoglu:

And exactly why one thing that you mentioned is that since Arrow is such integral and a valuable piece of the Dremio execution engine, any percent change even that you could achieve that actually reflects as a larger improvement on the overall performance of the overall pipeline so we'll keep seeing more of these things [crosstalk 00:13:42].

Kelly Stirman:

The inner loop, as they say.

Can Efeoglu:

Exactly.

Kelly Stirman:

Okay so let's talk about, see what's next here, LDAP group membership via SQL. So who cares about this feature? Why is this important?

Can Efeoglu:

So this has come up in every single one of our larger customers appointments and it has been one of the top requested features that we've been hearing on the security side of things. So with this function, piece of functionality. Actually let me repeat. Prior to this functionality, you could already do log based filtering or column masking based on who the user was running that query. But this kinda goes now we extended that functionality for group membership so now you can do things like hey this group should see the east region values for the table and this group should see the was region tables for this group ... the west region records for this table.

So this basically gives you much more flexibility beyond just what the user is but more based on group membership dynamically and delivering much higher levels of governance.

Dremio 1.4 Key Enhancements

Kelly Stirman:

Yeah I think one of the things we hear even we've for a user even if they've embraced LDAP as a central mechanism, perhaps directly to control authentication and group membership, it's very nice to have this additional layer of flexibility that can get you down to kinda cell level controls in a programmatic way.

This is just a simple example of a case statement that just maps or controls exactly what sensitive data is returned from a table. This of course would be embedded in a virtual data set and then a user would be able to see that virtual data set much like a view and relational database, even if the underlying data is in JSON, or AVRO or Parquet or what have you.

Okay, great. So now let's look at ... let's see what we have after this. So those are the key things in 1.4. Anything that we forgot to talk about that you want to mention before we look a little bit into the future?

Can Efeoglu:

There are a few other smaller enhancements that are very useful in the context of data cleansing for example, values brought in. One of the things that we were getting a lot of requests for, from the community side again on multiple users approached us in the forums, and as well as on the customer side how Dremio deals with non UTF8 characters.

Dremio 1.4 Key Enhancements

And one of the things we have done is we went through this variety of functions and ways to operate it basically of how users could easily work with non UTF8 characters and replace them as a part of their workflow. So we're giving them flexibility on whether they wanna ignore that field or whether that one, replace it with some non character. Basically allowing them to keep working without having to stop or with them having to deal with their ETL pipeline. So now you can do that in Dremio without having to change any upstream or downstream or upstream operation.

Kelly Stirman:

The incredible variety of character sets. I remember the day I learned that UTF8 supports Klingon, among many other things.

Alright. So let's talk a little bit about the future. No commitments here on timing but just to give you a heads up about what we're working on looking a little bit into the future. Let's talk a little bit about external reflections. First let's remind people. What is data reflection in Dremio?

Can Efeoglu:

So data reflections in Dremio are alternative representations of your data that are often physical representations of your data that are optimized for certain types of analysis that are invisible to the end user. So what I mean is let's say for a given dataset, you have a dashboard that has a range lookup query that's two times raw data for a given set of filters. Then you have maybe an aggregation based on maybe sales by zip code and you might have some other aggregation which is based on rep id and giving you again the overall sales for the last month.

And all of these are certain questions that you're asking on that data set. So what Dremio's data reflections enables you to do is optimize variety of workloads, variety of queries for a given data set and still have your end users hit that data set and Dremio handle intelligently and how it routes the query to individual different alternative physical representations that elevates your interactive responses on that dataset.

Dremio 1.4 Key Enhancements

Kelly Stirman:

So in that sense it's sorta like an index in a database that a user sends their SQL too a database. It's kinda slow. A system administrator could go in and add an index. The user doesn't change their query but now the query planner can be much more efficient in how it generates the query plan. And potentially cover the query from the index directly with out going in the live data.

So these reflections today which we've had since 1.0. They are raw reflections and aggregations with reflection and they live, as we talked about earlier, in a file system. Like S3 or in HDFS or MapR-FS or [crosstalk 00:18:54] or ADLS, Azure Data Link Source. So with all that as a little bit of a background, what is an external reflection?

Can Efeoglu:

So this has been one of the features that we have been working closely with customers on implementing. So what you're seeing in many deployments is our users already have an alternative representation of data in different partitioning schemes or different sorting schemes and in Elasticsearch in a certain type of configuration or in Oracle, they don't wanna necessarily recreate that within Dremio. They want to be bale to say hey Dremio, this representation of my dataset, it's matched to this dataset that I have in Dremio. Use that in your logic as a reflection.

And now Dremio, regardless of wherever this dataset lives, can understand what it is and use that in planning as one of the options it can use to substitute and make your queries go faster. So this gives a variety of benefits. One of them is, first of all again you don't have to potentially recreate a certain representation that you might already have in another system in Dremio. Second you can leverage more specific or niche processing capabilities.

For example Elasticsearch is great at certain types of analysis. You might wanna push down certain types of analysis form Oracle. Now you can do that for reflections on top of your regular work. So it basically gives you the ability to do desk techware specific use there. And Dremio still takes the most optimal representation for that. So you don't have to worry about which one you're gonna route the query.

You can pick an external reflection, you can pick a system maintained reflection. So it all again goes through Dremio's process optimization.

Kelly Stirman:

Alright I have a surprise question for you.

Can Efeoglu:

Go for it.

Kelly Stirman:

Could I make an Excel spreadsheet an external reflection?

Can Efeoglu:

Yep. As long as it actually lives on file system of any sort, you can. It should be easily done. This whole system is flexible enough so that you can map any data set that you have and any of the Dremio supported sources as a reflection.

Kelly Stirman:

Amazing. So even my kids could make an external reflection. Okay so external reflections not in 1.4 but something we're working on and should be available soon. I know that's true 'cause I've actually played with 'em so it's coming soon.

Now let's talk just briefly a couple more things. Interactive SQL. What is this all about?

Can Efeoglu:

So one of the main pieces of areas we wanna focus on is how we improve the experience of the analysts and how do we make them more productive in Dremio? And a part of this is having a more interactive SQL experience and that comes from multiple pieces of functionality actually. One of the bigger things for that is enabling a better SQL of the complete experience where as you're working and writing your SQL query for those power users of Dremio, they can come in and Dremio will automatically suggest tables, datasets, whatever that needs to be. Columns and datasets, and give you the context around those.

Dremio 1.4 Key Enhancements

On top of this, an other big thing that we're thinking here is guided function completion where when you actually go in and start writing a function, Dremio will first show you the descriptional function, present you with an example, validate your use of it as you're typing and suggest what you could put for certain arguments for that function without you having to figure it out or like go consult this other location in the product.

So it's all the way around. How do we make sure that you write it faster and once you finished writing, it just works for you? As opposed to you all having to run and test it and all of the complications that go along with writing SQL.

Kelly Stirman:

Okay. And this is really all part of our bigger vision of having a truly world class experience. You don't need to know SQL to use Dremio but for SQL jockeys, we want the SQL console to be really world class and to intelligently help you craft and maintain your SQL expression.

Can Efeoglu:

You got it.

Kelly Stirman:

Okay. Anything else here? So cloud. What else is coming in cloud?

Can Efeoglu:

So we already talk about how we now support Azure Data Lake Store and that's on Azure. So on the Azure side we're also will be integrating in the upcoming release the Azure blob store because we also see the storage is the place things go through many mediums in Azure. And on the Google Cloud side, we're also planning on and having an integration for our Google Blob store. So that will basically be just like Azure is now have parity with AWS, our plan is to also have parity on Google Cloud like what we have in AWS. So that's complementing that storage and usage for our users.

Dremio 1.4 Key Enhancements

Kelly Stirman:

So we really want it so that whether you're ... effectively you have the same ... we're equal opportunity cloud platform so whichever cloud you wanna use, there's no compromise. Its' the same sort of capabilities across the board.

Can Efeoglu:

Exactly.

Kelly Stirman:

Okay that's it. Thank you for going through that with me John. Let's get to some of the question. Again if you have a question, please add it to the Q and A. We're happy to talk through these. I have several in the queue here that we'll start to work through but over the next few minutes, please. Let's keep this interesting and fun for everybody.

First question here is ... it's sort of a foundational question which is what is a production deployment look like at a high level? Specifically maybe we could start with on-prem deployment and then talk about what things would look like in the cloud. Dremio is software you manage yourself. So you can run it wherever you like. It's not a fast product, at least not yet, but this is something that we're working on but definitely will come a little bit further down the road.

Let's talk about if you're gonna download Dremio. First of all, you can download and run it on your laptop. It's not a production environment, but we want people to be able to easily try it out and have a great experience int he first few minutes, but if you're thinking about a production deployment, what does a production deployment typically look like on prem and then let's talk about cloud.

Can Efeoglu:

Yeah, great question. So one of the first things to mention here is Dremio doesn't rely on any other piece of software for you to deploy. It could be deployed standalone if you don't have, for example, lets say an existing Hadoop infrastructure or it could play along with the Hadoop infrastructure you might have. So depending on your preferred method of deploying, you could use our YARN integration on the Hadoop side to automatically deploy and scale down enough your infrastructure.

So in a typical Dremio deployment, let me actually roll back a bit and explain what types of node types that exist in Dremio today. So Dremio cluster consist of one master, multiple coordinator nodes which basically courier planning, serving the web ui, coordinating the odbc and jdbc client connection and a set of execution nodes which handle basically courier execution and returning the results to the client at the end of the day.

And how you think about these different pieces of nodes is you basically scale cut your coordinator side to handle more concurrency and more users and return faster query planning times. And you can definitively scale your execution side to handle larger work loads, more intense work loads depending on what your needs are. So that's basically the on-prem side of things.

In a cloud scenario, just as easily Dremio can, you can leverage PC2 nodes, Azure nodes or Google compute nodes, which we all actually use on our internal NQA today, all three clouds. And again deploy your coordinator nodes inside that depending on the concurrency needs you have and deploy your execution nodes inside that depending on the workloads you have.

Kelly Stirman:

So do I need some piece of software and do this provisioning of nodes or where am I gonna do the provisioning part?

Can Efeoglu:

So again on a group scenario, we usually see our clients wanna utilize their existing YARN configuration so basically YARN manages how much resources Dremio gets and makes sure that it's deployed within that limit and on any scenario, there could be many alternatives that people could use. So in Germany, for example, we actually use Meadows to deploy Dremio. We've actually seen a community doctored image that people are using to deploy Dremio.

We also see our existing users have their own mechanisms for deploying in such environments where they manage how Dremio is deployed and undeployed or upgraded.

Kelly Stirman:

Okay so I have some number of coordinator nodes, which is probably smaller than the number of executor nodes, and a typical cluster is gonna five, ten, twenty nodes. Sort of depends on the scale. We have customers doing hundreds of nodes. We have people looking at over a thousand nodes. So it's definitely designed to scale and scale using commodity infrastructure.

But most people are gonna start with a relatively small cluster. A dozen nodes or so. Again you can try that on your laptop to get a sense for how these things work. So some number of coordinator nodes, some number of executor nodes, then the reflection, which are optional, I don't necessarily have to use reflections but when I use reflections, are going to live in a file system which coulda NAZ. It could be HDFS, it could be MapR-FS, it could be S3, it could be ADLS. All of those are viable options.

And then there is this metadata component that we talked a little bit about earlier that is going to live on a highly available file. Okay that's at a high level what Dremio looks like and for those of you who want more information on that, if you go to our website and go to the resources section, there is an architecture guide that gets into the details of the architecture and the different pieces of the puzzle.

Okay so the next question is if someone asks you to describe the difference between Dremio and Data virtualization, how would you respond? I have some thoughts on that but John do you wanna take a stab at it first?

Can Efeoglu:

Yeah, sounds good. So that's a good question. We actually get that from a variety of processors that we talk to. So one of the high level thing I would say is data virtualization and data access just one of the components of Dremio in a sense. So how we see the self serve problems of the the world is you need to give users the capability to access any data, whether that's a SQL based system or not SQL based system or whether that's an excel file, basically, right.

That's the virtualization piece. Then you need to give them the context to understand what that data set is and have the capabilities of being able to quickly do discovery on it. So these are things like hey, do I have a description for this? Do I have understanding of what this columns mean? Can I profile the data very easily? Can I understand who accessed this before or what type of analysis are being done on it. So that's the data caliber portion of it.

And then you need to also have the capability to accelerate any type of ... you know where it goes on it. So that means basically regardless of what that data size is or however many users are using the system, users still need to have and contract a response as they deal with the data system and that's a key part of what's perceived as self service. So if you have to wait one hour for everything that you do and then you have to come back again, that's not an ideal experience for a data consumer, data analyst. That's half the turnaround results in a shorter time period than hours or days or weeks of analysis.

And the last piece of that puzzle is having this experience all work seamlessly as opposed to having different pieces of software that you have to file work with it or data engineering and have a whole life process that might take two weeks, three weeks, four weeks end to end to have an end result. So we think that having a packet, having all of these functionality packaged together adds to the overall values logistically and always make sure that he end user experience is better.

Kelly Stirman:

I think it's great. Self service does some virtualization much better at modern sources like noSQL and Hadoop. Capable of accelerating in a scalable fashion in a way that data virtualization seems not to be ... Data curation capabilities to reshape the data integrate it into the product. Searchable metadata catalogs. Lineage capabilities in our enterprise products. And of course last but not least, this embracing of opensource. So you don't really see an alternative out there that is as sophisticated and mature as Dremio that's open source.

So that's that at a high level. Last question here that I see so far is how does this work in a windows infrastructure? So let's a today, if you wanna try this on your laptop in windows, it works great. I've done it several times myself. But windows is not supported for production deployments. It's really just a way for you to experiment and get a sense for the product. If you're going to deploy Dremio, it's going to work in a ... it's supported in many flavors of Linux.

Let's see if there are other questions out there. If not we'll wrap up and give you guys, give you folks a few extra minutes. I don't see anything else so I'll say thank you for joining. We appreciate your participation. Feel free to pose additional questions on community.dremio.com. Love to hear back from you on comments and feedback. Questions, we're here to help. Look for another release coming soon and another one of these thrilling webinars as we talk about these capabilities when we have our next release.

Thanks again. Have a great day. Take care.

Can Efeoglu:

Thank you.