49 minute read · February 17, 2019

Dremio 3.1 – Technical Deep Dive

Copied to clipboard

Webinar Transcript

Kelly:
Thanks for joining us for today’s talk on Dremio 3.1, a little bit of a deep dive on this subject. I’m joined here with my two friends, and partners in crime, John and Jeff. We’re going to talk through the release and answer your questions at the end.So, first order of business, if we advance the slide here. First order of business is just, I’ll ask the questions. So, we’re in the Zoom, in Zoom, you have a button you can click to ask a question. Please enter them through this interface because it lets us triage the questions and mark some as answered, just makes life a lot easier for us. So appreciate if you use that feature to ask your questions today. We’ll have plenty of time to get to those at the end of the talk.So, where are we? Well, we’re in February, but a couple of weeks ago, we launched Dremio 3.1, and just to give you some perspective on our release cadence, we came out of stealth less than two years ago, and unlike some software products, these dot releases can include new features and functionality. So 3.1.1, 3.1.2, 3.1.3, those furthest to the right numbers do not introduce new features, but when you see a .1, a .0, a .2, .3, .4, it’s almost always the case that we’re adding new features in those releases.So we’re moving quickly and we’ll try and cover things in 3.1 today, but by all means, if you have some question about something that came out in 3.0, feel free to ask and we’ll do our best to best answer.So one quick plug, because I think a lot of you who are on this chat today will be interested, we yesterday launched Dremio University. We have basically two or three classes you can register for currently, and it’s a university program and we’ll continue to roll out new classes throughout the year. We started with some Dremio fundamentals, so covering the basics of how you use the product.What’s really cool about this experience is, when you log in, we will provision an instance of Dremio Enterprise Edition that is private to you, that you can do whatever you’d like with. Of course, we encourage you to follow along with the course, but it’s yours to use for 48 hours and enjoy. So, it’s a great way for you to get experience with the product without worrying about where to install and run Dremio in the first place. So that’s Dremio University, we’ll talk a little bit more about it at the very end.Getting into 3.1, what are we going to talk about today? Well, we have three key areas that we want to cover and I’ll sort of help facilitate the conversation but these are features that Jeff and John are going to dig into, and again, please enter your questions through the Q and A interface. We’re going to start with query planning and execution, some updates there around workload management and some other topics.Then, we’ll talk a little bit about data reflections, one of the most interesting features in Dremio and something that we developed a course for for Dremio University, because this is a really important and large topic. So, we’ll talk about some enhancements in 3.1 related to data reflections. And then finally, connectivity in our art framework.So, let’s get into these topics. First Multi-Tenant Workload Controls is part of the, one of the big features we introduced in 3.0 and so there’s some changes here in 3.1 that we want to get into. John, do you want to talk us through what is workload management in the first place. Why on Earth do you need this?John:
Yeah, exactly. So as Dremio clusters grow larger, and then you have many teams working on it at the same time, one of the issues is, how do I isolate certain of groups of users? How do I provide fair sharing of the cluster? Also, how do I also prevent bad users or bad actors from taking over my cluster?So, as we were thinking through all these problems and opportunities that we can help our customers in, we started working on the workload management project which is basically, give you enough control so you can, again, be able to define resource usage or caps per user per group, and easily, kind of integrate that into the model in which some companies have internally.So, that was the motivation for us starting with workload management. And, you know, how we’re getting there is by introducing the concept of queues and rules. And how this works is, basically you define a set of resource queues to help manage things like coin currency, CPU priority, memory allocation, as well as time limits. So they can say, for example, that, “Hey, I have a marketing team queue that has high CPU priority. I counted on 10 queries at a given time.” And just to make sure a query doesn’t take over I’m going to limit the maximum amount of memory that a single queue, query can take to, let’s say 30 gigabytes so I have resources for other users that come in.And on top of this, there are many scenarios, what you also want to protect yourself from is runaway queries. Meaning you intend some queue to have some sort of capacity but you don’t want it to have, let’s say, queries that are running for 30 minutes. So you kinda want to cancel the queries that are running for longer than some time, which you can also handle with queues.So, once I have these queues built for different purposes in my system, the second thing becomes, you know, how do I put my jobs into individual queues? And we are introducing a smart rule-based routing mechanism for this where, you know, as a user or as an administrator, I can come in and say, “Hey, if the user is John, Kelly and Jeff, or they’re, let’s say, in the Group Credit Risk team, right, put them in this queue.” Whereas hey, this JOBC query, coming from a client and it costs more than this threshold – actually reject that query because I don’t want that kind to overtake my cluster. Or if things like, hey, if it comes between like 9AM and noon – put it in this queue or put it in this other queue.Basically we’re giving you enough of these targeting capabilities for you to build and match your existing system and to offer, you know, a best of class workload management capability.So you have these rules, right, this set of rules, that you define, after you set up your initial resource queues, and then in tandem they give you a world class workload management system.Kelly:
Okay, super interesting. So, some of the things you have here is ways to construct these rules.John:
You got it.Kelly:
You have users? So identify a particular user, also a group that they’re a member of?John:
Yeah. Basically we typically connect the identity providers, like active directory or LDAP, and you can actually use the user and group definitions from there in your workload management.Kelly:
If you’re using enterprise edition and you’re integrated with LDAP, you can take advantage of user and group identity in these rules. You can also take advantage of the type of query – so some of those are ODBC, JDBC, through the web interface of Dremio.John:
Yeah, I mean, you may want to budget for your reflections in a separate queue or have a shared queue for that across teams, maybe if I know how to put that separate so you can also target for example reflections to be in a specific queue other than your ad hoc queries, one big pattern is going to have an ad hoc queue as well as some other reflection, or longer running job queues so that those jobs don’t take away from your short lived, interactive workloads.Kelly:
Okay, and then you’ve got this cost thing, I see a big number there, how should I think about cost, how does cost get measured?John:
One thing I wanna tell is cost is an abstract concept from a Dremio planner standpoint, but one beautiful thing about this is every job profile in the Dremio UI actually already has a cost information so I can go and look at my Dremio query logs, historically and get a sense of where my percentiles or costs are and make decisions based on that if you wanted to kinda have that separation of big queries from small queries and also make it faster.Kelly:
Okay cool, so, do I have to use workload management? Is it on by default?John:
It is on by default and then we have a set of queues and rules that we think work best in like an out of the box deployment, it’s all up to the user to take it to the next step to customize it and create their own custom queues or custom limitations and rules but out of the box it’s enabled, you get some of the benefits in terms of this privatizations that we determined, for example by default we create a separate queue for reflections and we create a separate queue for other queries and within that queues, you also have a specific section for things like UI queries which are more time sensitive, so we wanna give those high priority, whereas some of the background reflection jobs are like lower priority.Those are just suggestions to get folks started, and it typically is gonna depend from environment to environment, but out of the box to give you airway.Kelly:
So some people, we introduced this in 3.0. So what’s changed since 3.1?John:
So in 3.0 this was a preview feature, turned off by default, and there were some limitations because it was a preview feature. For example, you could not change your queue definitions. You would have to remove a queue and enact a new one to change anything. Some of those preview limitations are no longer there and this is enabled by default so its protection ready. We’re looking forward to our ports using this.Kelly:
Okay great. But its not an enterprise feature? Or is it?John:
So advanced Oracle management is being to able to do your custom queues as we have been talking, is an enterprise feature.Kelly:
Right, okay.John:
The community edition has preset queues that you can customize.Kelly:
Great. Okay, again if you have questions about this please add them through the zoom interface we’ll be sure to get to them at the end. Next, you had a little bit more here in terms of, I think you could point out-John:
Yeah we kind of mentioned this verbally but each of the group profiles now has some more details. In the high level profile you can now see the time and queue spent in the queue. Sorry, the time and job spent in the queue separated out from the rest of the time so you can understand if you’re waiting in the queue for too long. Then at the detail level which is the screenshot above, you can kind of get details on the costs of that query to make that decision and why a certain query was placed in that for pertaining purposes.Kelly:
Okay. Next topic we wanna talk about is something near and dear to my heart. Enhanced Previews, and John can also talk to us about this. I’m sure everyone has experienced the issue we’re trying to solve with this, so.John:
Yeah, you know one of the biggest experiences in the Dremio UI, being able to work with your data as you’re going through your analysis. One of the things we strive for is, let us give you a preview for a phase of each of the steps in your transformation journey so that you can get a better understanding on what’s going on as opposed to doing a bunch of steps at the end, checking if it works, to validate the preview as you go along basically. But in some cases, especially where you have a slow data source or a large data source where previews are taking a few seconds, and you’re basically going through analysis journey much faster. You want to do three consecutive tests, and then check for something or do five consecutive actions and then check for something. In the previous versions, that would basically be not a quick experience because after each one of your operations you would wait to get a preview.The way we’ve enhanced so our preview has sought out now, previews are running in the background in a sense, in a less synchronous manner, so you can keep working on your query or your transformation without having to wait for the result of preview. This gives you so much more speed and flexibility especially in cases where you know you have a set of things you wanna do and you’re just going through them bam bam bam, and then getting the result rather than doing an exploitery type analysis.Kelly:
Great, so the spinning wheels not going to block them anymore?John:
You got it. Yes you can look at the catalog, you can look at the graph, the SQL editor, or the visual transformations without having to wait for your slow source.Kelly:
Great. Lets go to talk about the next topic which is enhancements to Gandiva. Jeff, talk us through this Gandiva name, where does that come from?Jeff:
Yeah, so we mentioned this in previous web interests. Again, Gandiva is the mythical bow, Indian folklore that’s basically a bow that could give the power of thousand times what it could accelerate an arrow, so it made it much much stronger. So we’re taking that same concept here, one of the things we strive for at Dremio is that every two years we wanna improve, our goal is, we have an internal goal, is to improve performance by 10x. This year we’re doing that with Gandiva.Jeff:
As you know, previous to this, the arrow initiative which was an effort to ensure we used in-memory buffers for all the commoner data, Gandiva is all about trying to get performance as close to harbor execution speed as possible. By combining Gandiva in specifically the execution kernel which we call LDM with Arrow, we can perform really low level super fast operations on things like sorts, filters, rejections. The result of which is we’ve got some customers that actually see up to 40, 70, 100x better performance with the operators that we’ve sort of “Gandivaized” up to this point. Again we’re taking advantage of the state-of-the-art CPU’s and GPU’s to really accelerate greater performance.In this release, as with the previous one its still in a premium mode, so if this is something you wanna experiment with, play with, we’ve got a lot of folks that have been doing some experimentation and kind of been doing some things side by side comparisons. Drop us an email on this slide here, sorry, the previous one. Slides must be out of order. There’s an email that you can send [email protected] if you wanna actually have this supported.Kelly:
If you wanna enable it in your internet installation.Jeff:
Yeah.Kelly:
If Gandiva makes things better is there anything I have to do? Do I need to change my queries, or reconnect to my data sources, or rebuild my reflections or anything like that to take advantage of Gandiva?Jeff:
No there’s nothing you need to do. In the previous 3.0 we first introduced Gandiva, if you wanted to take advantage of Gandiva your query had to have nothing but Gandiva operators and functions otherwise we’d default back to the Java way of doing things, but in this release, 3.1, you could have a mix of old within your query, and we will dynamically switch between each execution and do that dynamically. There’s nothing you need to worry about. As we go forward, we’re just gonna continue to add more operators, more functions to Gandiva, so things like aggregations and joints, and it’ll just get faster and faster with every subsequent release.Kelly:
It looks like this is only available on Linux. Is that correct?Jeff:
Actually we do have it on Windows now, that’s the latest one, and we got a community member actually do some work on that, so there’s a community person available.John:
Actually it was one of the founders of Arrow project that did a port on Windows. I don’t know if its finalized yet but I think its nearly there. Pretty exciting.Jeff:
If you go back, we have some support for us. I don’t have this, it doesn’t cover it, but support for OS X as well.Kelly:
Excellent, so everybody benefits? Nobody has to do anything to take advantage of it, but its not on by default? I have to ask for the key, but once I could start using it and now I’m off to the races.Jeff:
Yeah and its a pretty simple process you set it in and all we ask, if you’re willing, just kind of share what the sort of, what kind of results you’re seeing and also let us know if you’re running into any issues we certainly want to hear about that.John:
Jeff it sounds like this is just the beginning of this, should we expect more performance things from this going forward, like an ongoing project?Jeff:
Yeah like I said I think we have plans to add other operators, aggregations, joints, whatnot to the list. I think the other thing to is, there’s a number of things we wanna do so if folks have ideas what would be operators they wanna see Gandivaised, we can certainly prioritize according to that. We’re open to your opinions and suggestions.Kelly:
One little thing I saw in the slide. You mentioned UDS. It says that you can do UDS in Gandiva. Is there like a tutorial or a blog post or something like that that it tells me how to do that?Jeff:
Yep there is, there’s a blog on our website that kind of walks you through that. You could just search for it or we can provide the link.Kelly:
There’s probably not a whole lot of web pages that Gandiva and UDS I’m guessing. So probably that one ranks pretty high.Jeff:
In fact it’s the first.Kelly:
It is.Jeff:
In fact it is the first.Kelly:
Just mad SEO skills. Thank you, thank you Jeff for that. Lets move on to the next topic. We have gnarly in a hard hat, so what’s going on here? Improved data reflection matching.John:
Its working hard to simplify life for you Kelly, so that you don’t have to do the work.Kelly:
I like that very much.John:
Just to do a high level view before we beat down on this, as you know reflections is, data reflections is a technology that is built in to Dremio for providing interactive speed regardless of what data searches are interacting with. One of the beautiful things about reflections is the fact that as a user you do not have to say “use this reflection” or have to figure out which reflection makes most sense to get you the speeds you want. It all happens on empty covers and is handles by Dremio’s reflection optimizer by matching, that’s what we call it, matching reflections to queries. Just to kind of give you guys a sense of this process, when once a user submits query, two paths happen.John:
We first generate the query plan on how to best execute that query, many plans in fact. Then we also look at all the candidate data reflections which we think may apply to that user query. We understand does this reflection cover this query, and is does this work, basically, can it answer this question? Then we also generate a bunch of plans using these reflections. Then we look at all of these plans containing reflections and the one that are denied, and kind of do a cost based analysis and run the query in a way that makes the most sense. The process of evaluating a reflection for coverage, or basically to answer the question of hey, can Dremio use this reflection to answer this question is called matching.We basically rebuilt our data reflection matching algorithm. This provides an amazing amount of benefits. One of the first things you’ll see is you’ll get just, an increased amount of matches even in cases where you have thousands of virtual data sets in a complex semantic layer. We work with customers who have tens and thousands of virtual data sets with many chains, and in those situations we’re observing sometimes reflections wouldn’t work as expected. We’ve revamped the way we look at this now regardless of the complexity, of the structure of your data sets, as well as the complexity of your expressions or so, you should expect reflections to match in a much more reliable and as expected manner.Kelly:
That’s pretty cool John, does this mainly have to do with our customers creating a deeper chains of VDS’s and more complex sort of-John:
Yeah, you got it, any customer even the ones that don’t have the chains to benefit from this, but it will specifically shine in situations where you’ll have a complex transformation graph, or you’re kind of creating a reflection that they’re portraying or you many reflections. Dremio is much more adapt at now kind of figuring out how and when do you use reflections. Basically you should see things at least appear as matched, basically, idealized costed in, or opposed to saying “could not cover”. You should expect to see less than “not covered” in situations where its actually-Kelly:
So on the right, we have, this to me, as a user-friendly hub, is a query profile on the aggregation tab. You can see this query was accelerated and the first section there tells you which reflection tells you which reflections were used to accelerate the query, in this case it was just one.John:
You got it.Kelly:
There could be lots of reflections used to accelerate a query. What I see below that are the ones that Dremio, that I guess matched in terms of, they could logically be substituted for the raw physical data. So they were considered in that top branch of this two path tree that we looked at in the beginning. Of those three the first one it says too expensive, so I think you’re saying that means it covered the query, but, it was deemed more expensive to use that reflection than the one that was used for acceleration.John:
Yeah.Kelly:
Which, in this case, the one that was more expensive was a raw reflection versus the aggregation reflection, assuming that’s the same data, probably that aggregation reflection is a lot smaller, so its gonna be cheaper to scan that.John:
Yeah its all about shortcuts, as opposed to scanning a billion rows of raw data, Dremio is basically saying, hey I already have this rolled up, so let me use that that’s going to cost me much less.Kelly:
So its never gonna say “matched”, its either gonna be the one that was used to accelerate or it will say “too expensive”. Then, the other two didn’t. They didn’t cover the query so maybe they were missing some columns, or missing some rows, they were the right data set.John:
Or maybe, for example that reflection is for the west coast sales and the question was on the east coast sales, so I can not really answer that question from the reflection because its just not covering the reflection but you got it.Kelly:
Cool, okay. On the next slide, reflections are a big topic, we have some resources for you, there’s a relatively new white paper that gets into a lot of detail around best practices around data reflections.Jeff:
There’s a new course on Dremio University specifically about data reflections. That’ll go live next week but its something you could register for today. There’s a tutorial if you’re just getting started, if you’ve never used data reflections its a good tutorial to watch for some examples. Those are three resources available to you. Lets move on to the next topic, the new “ARP” Connectors. ARP.John:
Before we move on could you explain the history of this gnarly, with the hat, Kelly? Is there a specific story behind this one?Kelly:
Is he a gambler?Jeff:
He looks like he’s got one of those poker hats on or something.John:
Is that a bookie?Kelly:
Have you ever seen a gambler wear that hat? Its a bookie hat its not a- I don’t know I wish you didn’t put me on the spot like that, I just don’t know.Kelly:
Jeff tell us about what does ARP stand for?Jeff:
Its the Advanced Relational Pushdown, so its a fancy way of saying the way that we create connectors is much more structured, much more configuration driven, less prone to error, allows us to create connectors faster with better pushdown, more quality, yada yada yada. What we’re also gonna kind of tease people a bit, we’re thinking about extending this and building STK based off this that we can ride back to the community so folks can build their own connectors at some point. More to come on that in the future, just kind of wanted to throw that out there.Jeff:
In this latest release, it was Oracle, mySQL, and AWDRedshift that we added, or basically we always had these connectors but we just “ARPified” them, we just re-developed them. Then of course Teradata, which is brand new. But any already new RDBMS connectors that we come out with are going to be used, created with the ARP framework. You take a look here at the bottom, this picture I’ve got of the connector framework, here on the left hand side this is the ARP API that we use to create our RDMs’ connectors. On the right hand side you’ve got our file connectors: S3, HDFS, ADLS. Those use the DFS API, that’s an internal API rate now. In the middle we’ve got our top source level ones, these are the ones that classically we’ve always used to create all the connectors, so this is like MongoDB, elastic search, and whatnot. This kind of gives you a view of how we look at the different connectors and the different STK’s involved.Kelly:
Okay, so ARP, Advanced Relational Pushdown. Not the sound of a Narwhal.Jeff:
Its actually kind of a coincidence isn’t it?Kelly:
If anyone on the webinar has… this is insider info. The 404 page on Dremio.com, you can hear the sound of a narwhal.John:
The crying narwhal?Kelly:
No its just the, I don’t actually know what the narwhal is saying but its not a crying narwhal its just the sound of a narwhal so check out now time-Jeff:
Its an Easter egg.Kelly:
So lets go onto the next topic here. Ill say a little about Dremio University, like I said at the beginning we just launched this on Wednesday. It’s free, this is free for everyone to use and we have Dremio Fundamentals and data reflections. Dremio Fundamentals is live, you can start taking it, data reflections you can register for. Dremio for data consumers you can register for. We’ll be rolling out more courses, we’d love to hear from you if there are particular things you’d like to see us develop. I think this is an important evolution of Dremio that we’re trying to make it easier for people to master Dremio and to get more value from their data faster.While I think Dremio is really easy to use, I think its also incredibly powerful and and somethings really do require some mastery. Hopefully this is a way to get you there faster than trial and error and just reading documentation. If you have questions along the way, and you wanna ask, we have a topic on the community site, and I hope everyone on the call is registered on community.dremio.com, but you could post questions there and feel free to answer other peoples questions that they ask on the community site. There are some tests along the way, and if you pass with an overall score above 75% you get this nice certificate on the right that you can post on LinkedIn, you can show your mom, whatever makes you feel good.Another way we make this easy is we provision an instance of enterprise position on your behalf. Its private, just for you, like I said you can do whatever you want. It will self destruct after 48 hours. So you can go in, provision this instance, you’re welcome to follow along on your own, on your own laptop or your desktop or your cluster or what have you, but you might now have access to enterprise edition. So what we wanted to do is make it so you would have access to enterprise edition and not have to worry about whether it’d run or not. If you go away and come back, the state of your instance is preserved during that 48 hour period. These things are not free for us to do so we basically have a thing run for 48 hours and then self destruct. If you have any questions, feel free to send us an email at [email protected]. If you have ideas about other things you’d like us to develop courses on, its another great place to send.Jeff:
Pretty cool, how long do you think it takes to roughly this course-Kelly:
Good point, so Dremio Fundamentals is about two hours.Jeff:
Okay.Kelly:
The Data Reflections course is, there’s a lot more material, and I think that’s probably more like four to six hours. There’s a lot of in depth information and examples that help you think about designing data reflections, about how to optimize and different kinds of trade offs to consider and also how to make sense of the query logs, not the query, the query profiles and what they can tell you to help you make better design decisions in your data reflections. I expect that we’ll add another course or two on data reflections but this one goes pretty far. These are things that take hours, they don’t take days and weeks but they do take hours of time and that’s why we keep them available and the state of your node is preserved for those 48 hours.Jeff:
The cool thing is you get gnarly to sign your certificate if you pass.Kelly:
It takes some time, because the fin and the pen, you know, you have to go through a few certificates to get the signature. Moving on to just pointing you at some resources for the release notes for 3.1, there’s a link to Dremio University and then feel free to download 3.1. So I think we probably have some questions that we wanna get to here.Kelly:
The first one, do you support LDAP and active directory or do you support Single Sign On?Jeff:
John you wanna answer that?John:
We support the LDAP today, and then we actually do have plans to support Single Sign On in our roadmap for this year. You’ll see us add Single Sign On support too.Jeff:
And that’s an enterprise feature?John:
Yeah both LDAP, AD, as well as Single Sign On, upcoming Single Sign On functionality is going to be an enterprise edition.Jeff:
Our philosophy around features and enterprise editions, security, and security related functionality is available in enterprise edition. Community edition, you have internal security model, so you could create users but basically everyone is an administrator. Whereas in the enterprise edition, you integrate with an external authentication mechanism and you have admins, non-admin users and we’ll continue to build out security features. We have TLS in the enterprise edition, but they’ll be more and more to come, but things related to security will be in the enterprise edition.Jeff:
Next question here, is there in the roadmap for future reflections functionality improvement allow incremental updates even with restrictions for example how SQL server has index views? I was a long time ago a SQL server DBA but I do not remember what that particular example is with index views. So John do you want to talk to us a little bit about incremental updates in data reflections today and as well as things we might do as well in the future?John:
Kind of zooming out when you’re working with reflections there are two main patterns of reflection refresh. Basically you can tell it say, update this table with a reflection every x many hours and then it will automatically update the full thing. In scenarios where you have an append-only data set underneath the covers we can say something like hey, do a full update first but then only capture the incremental changes so as I don’t have to go and then redo the whole reflection. This is specifically powerful when you’re working with large fat tables that you only have entered.John:
One of the requests that we’ve been getting which is also the subject of this question is to allow incremental updates on a larger area of data types. Today, we only support incremental updates where you could have a big end type, or whatever that maps to big end type as the increment key. The increment key is the key we use to capture what changed since the last time. We do have plans to support a larger set of increment data types keys, for example we’re thinking daytime time stamp as well as strings as starting points. That should work regardless of whether you’re recording a view or a table as long as you can give us a relative data type we’ll be able to do a incremental update on that.Jeff:
And that’s append as well, too?John:
Yeah, this is all, again, incremental updaters, append-only pattern.Kelly:
Okay, thank you John. The next question is actually a very specific question about query that isn’t performing partition pruning as the user expects, we’re not gonna answer this on this call but it does raise the topic of partition pruning which is in these data reflections of course, you have an option of specifying a partitioning key, for example, on date. When you choose a partitioning key Dremio will basically build data reflections in directories in your data reflections store which could be S3, or ADLS, or HDFS, or file system. Then a query time, Dremio can scan just the directories that are relevant for that particular query if you have a predicate that includes values that are in the partitioning key. If you query on a particular day it can just scan that days worth of data instead of scanning all of your data. Its a common optimization, so that’s something available to you in this particular question, is about a specific query.John:
The better way to get some feedback on this would be to post this as a community post.Kelly:
Thank you John, yes its much easier for us to look at and give you an answer that might be beneficial to other folks. Appreciate it if you’d post it there, anonymous attendee.The next question is a great question, how would a community member use ARP to build new connectors? Is it available to community, for example on GitHub?Jeff:
Yeah, great question. Its not available today but it is our plan to, like I said, come up with earlier in presentation, come up with some sample ARP files, some documentation and just basic “hello world” sort of beginning to it. Stay tuned and keep an eye out and we’ll have something hopefully soon.Kelly:
If you’re still on and you have a particular one in mind that you were thinking of that you’d wanna build, we’d love to hear what you had in mind and maybe that, that’s something already in progress or we’re always curious with what people want. There’s always esoteric data sources that are interesting to hear about.Do you have the connector for Teradata?Jeff:
We do, it is a EO only, enterprise only connector. Its currently in preview so its basically on demand so we’re providing it to customers on an on demand sort of basis. So contact your local sales rep or SE if you’re an enterprise customer and we can get you set up.Kelly:
The next question is does Dremio allow you to tag or provide metadata at a column level yet?John:
Today we don’t but just to kind of go over at a higher level. Today Dremio does automatic metadata discovery from all your sources brings in your table information as well as column information. With Dremio today you can actually go and tag data sets as well as interview at the data set level or for your spaces or your sources. Some of the things we’re thinking about in this regard to improve this is to support in future releases with column tagging as well as column descriptions and all of this be usable within a search and audit context for our enterprise users. Today you can do tags and wikis on data sets.Kelly:
So you could describe a column on your wiki page?John:
Of course, which is pretty common by the way.Kelly:
But its not gonna travel with that column into other virtual data sets?John:
Yeah.Kelly:
So its something we wanna get to but its not there yet.John:
Yeah the holy grail would be propagating that the crosshair data graph with search integration and all of that in one place, so you do it once and then available for all downstream data sets.Kelly:
Here is an interesting question, maybe a puzzle. “If a data analyst creates a virtual data set from a data set they upload to their home area, as reminder to folks you can upload things like spreadsheets, JSON files and other things into your home space but they’re private in your home space, nobody can see them but you. If you upload something into your home area, on your home space can it be migrated to a virtual data set that everyone can access based on the same data set if that data set is moved to be a source and not just in your home space.”Just to rephrase that, I have this really important piece of data, I upload it to my home space, I wanna make it available to everyone. I go to IT or whatever I put it somewhere on a file system or in a database where everyone can access it, is there some way that any virtual data set I built on the data when it was in my home space could be reliably migrated to the data now available as a public source.I think I know the answer to this but I’m not the expert. I’ll try answering it, which is, if you look at your virtual data set, it is defined by a SQL expression. In your SQL expression you have your select, your from, and your where and everything else so if you change the from to point at the source that is public to everyone instead of whatever the source is in your home space, assuming the column names are the same, you should be able to mostly just make that edit to have things work. I say that because I’ve done it a few times. Its something that’s pretty easy for you to try out and maybe there’s some fine tuning you have to do but that’s one of the virtues of having these virtual data sets based on SQL is that something you have access to do, something you can edit, including pointing it to a new source when appropriate.John:
Building on top of that point by the way, we actually have some of our customers build layers of data sets, basically a semantic layer, lets say, on a data warehouse, but there’s migrating some of the workloads into a data lake, they can just swap the tables underneath but keep the whole semantic, the layers and definitions the same and get to the same place without having to redo a bunch of things, so its very flexible in that area.Jeff:
I’ve actually done this with CTAS, like I CTAS something off to somewhere else and then used the same query and another query somewhere else to get to it.Kelly:
Last question we have right now, feel free to ask a few others here before we wrap up. The question is, “can Dremio import data from SEP and SFDC and Salesforce to do some data wrangling? What I would say is if you look at the data sources that Dremio supports today, those are databases and file systems. The idea of Dremio supporting an application as a data source is something that we’re excited about and part of embracing this ARP framework is most things like Salesforce and SAP have ODCB and JDBC drivers that make it so different tools can access the data in those systems. Of course those systems run on a database but it would be a lot easier for you to query through API’s that simplify how you accessed the data.There tend to be lots of drivers out in the market, so the idea is that ARP would allow us to get access to those existing drivers much more easily, but nothing has been done yet. We’ve done some tinkering and in our internal builds we’ve played around with Salesforce and some other things but nothing is out there and I think one of the things Jeff’s really excited about is getting an SDK so people in the community could build some of these connectors and share them with others. Good idea, but lets just say you had that connector there, well you would be able to write queries and have those queries execute against the source or you could build reflections on the data and have Dremio run the queries. There’s a lot you could do with that in place.Another question popped up here, “do you have plans for supporting native SQL syntax in virtual data sets to make use of database or user functions in the underlying DB?” The reality is every database has a different flavor of SQL they support and specialize operators that may be specific to that particular database technology. The question here is, can I just pass through some SQL syntax that I don’t need Dremio to interpret I just want it to push this SQL down into the underlying source. I think that’s the question, do either of you guys wanna take a stab at answering that?John:
This is something that actually came up in multiple contexts like the folks asking can I enter my elastic search queries as a virtual data set that is a data set commission or for example, SQL server, hey I have a table generating functions how do I use that with Dremio? We’re still looking at our options and our current thinking is to stick with higher on-SQL syntax and then transfer as much as possible, but considering for folks to leverage some of these native capabilities and sources somehow in those queries as opposed to completely switching the way we do SQL’s. Still, that’s something we’re thinking about and it definitely came up more than once but I don’t think we’ve arrived at a solution necessarily on that just yet because we kind of want to preserve the simplicity of our current approach but we also wanna give you the flexibility so that’s the trade off we’re thinking about.Kelly:
Another question came in, this is an interesting one. “How does Dremio handle JSON documents with different fields and fields with inconsistent data types, invalid dates, different nested structures, arrays? Once you’re in the world of JSON, its the wild west, sometimes, and there is no schema. None of these systems have a schema, a catalog schema that you can go. Like in a relational database you have system tables, the same thing doesn’t really exist in elastic Mongo. What do we do? For example, if you have a field in JSON, and a 100 documents have it as a string and the next 100 documents have it as a date, what would happen in Dremio?”John:
The way we handle this for something like elastic search versus a JSON document on a file systems a bit different. Let’s say you’re working on S3 and you have some documents you’re analyzing. In that scenario what would happen is once Dremio detects that a certain column has more than one data type, lets say some are strings some are numbers, we actually recognize that field as a mixed type field. Within Dremio’s UI, or through our API’s you actually have some workable options you can take. You can say hey, take only strings, you can say, only keep the other data type, or what you can say, convert all of them to the same data type whatever that one is and mark invalid ones as no, or empty strings or a default value that you choose. We kind of give you the flexibility for dealing with that stuff without having to write scripts or some other job externally to Dremio.When you’re talking about a system like elastic search, which tells you the type it has, its a bit different. We try to course those types into the correct the type that we get from the elastic search mapping, and if that doesn’t work you’ll either see an error in some cases, or the value with be empty in some of the cases depending on how we’re handling it. We do provide a bunch of different ways that you can deal with that within Dremio.Kelly:
And if you have nested structures, like arrays or arrays of sub documents, Dremio gives you syntaxed extract those into individual columns.John:
Yeah, we have a few things. You can literally go JSONdoc.a.b.c.e to access a specific field, or you may choose to, for example, flatten all the list items so that you can do further analysis on that. Just like these three you have a bunch of different ways that you can wrangle and work with a complex data, either using SQL or using a UI based transformation capital.Kelly:
To me one thing that’s pretty different is if you connect to a table in a relational database, then you’re good to go, whereas in these SQL sources you may have some work to make the data structured in a way that’s useful. Maybe you don’t care about the field that has an array and you just ignore it or maybe you need to unwind it so that you can deal with the values and the lists with your array. There’s a little bit more to do before you’re off to the races when its a JSON source.We have another question just popped up. Its a comment, maybe we have something to say about it. “Connectivity with a SAS client was not as good as connectivity with a Tableau over ODBC. Big difference in the amount of times queries to run.” First of all I think this is another great thing to go into community and post an example and see if there’s any feedback about whats going on. We could look at the query profile to see what may be going on that causes such a difference in time to process the query. Certainly SAS and Tableau generate very different SQL for asking the same question, but we’d have to look at a profile to have a better sense and the community’s a great place to post that.John:
One quick suspicion probably is SAS may be holding a raw data set or a larger set of the data set as opposed to Tableau pushing down most of it to Dremio so you don’t get bottle necked in the transport layer.Kelly:
SAS definitely has a pattern of data access of pulling the whole data set into memory and then working on different intermediate representations of data versus other tools and Tableau is an example that pushed lots of queries down into the source. I think its a good hypothesis that its probably just one query’s moving a lot more data than the other, and the querys are actually not equivalent.So that’s all of the questions we really appreciate participation and all the enthusiasm about Dremio, we will be back in a few months with the next release. Love to hear feedback, so keep those questions and letters coming on the community site and we look forward to seeing everyone with their own certificate of completion from your favorite Dremio University course.Thanks everyone, take care buh-bye.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Dremio 3.1 – Technical Deep Dive

Table of Contents

Webinar Transcript

Try Dremio’s Interactive Demo

Get Started Free

See Dremio in Action

Talk to an Expert

Make data engineers and analysts 10x more productive

Table of Contents

Webinar Transcript

Try Dremio’s Interactive Demo

Additional Resources

Apache Iceberg: The Definitive Guide

Intro: Data Lake vs Warehouse by Dremio

Getting Started with Apache Iceberg Using AWS Glue and Dremio

Get Started Free

See Dremio in Action

Talk to an Expert

Make data engineers and analysts 10x more productive