Gnarly Data Waves
Episode 42
|
December 19, 2023
What’s new in Dremio: New Gen-AI capabilities, advances for 100% query success, plus now on Azure
Learn what’s new in Dremio - and how you can accelerate self-service analytics at scale - including new Gen AI capabilities, Dremio Cloud SaaS on Microsoft Azure, advances to ensure 100% query reliability, and expanded Apache Iceberg capabilities to streamline Iceberg adoption and improve performance.
Dremio delivers no compromise lakehouse analytics for all of your data – and recent launches are making Dremio faster, more reliable, and more flexible than ever.
Learn what’s new in Dremio:
- New Gen-AI capabilities for automated data descriptions and labeling
- Dremio Cloud SaaS service now available on Microsoft Azure
- Advances to ensure 100% query reliability with no memory failures
- Expanded Apache Iceberg capabilities to streamline Iceberg adoption and improve performance
Watch or listen on your favorite platform
Register to view episode
Speakers
Colleen Quinn
Colleen Quinn is the Principal Product Marketing Manager for Dremio Sonar. She’s spent more than 15-years working across the analytics lifecycle, from data warehousing and big data, to data lakes, cloud analytics, and, now, the data lakehouse.
Mark Shainman
Mark Shainman is a Principal Product Marketing Manager for Dremio. He has spent more than 20-years working in both the analytics as well as privacy, governance, and security space. He has worked with numerous data products and on numerous initiatives, including database migrations, data warehousing, big data, SQL on Hadoop, data lakes, federated query access, data cataloging, privacy compliance and, now, the data lakehouse.
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Opening
Alex Merced:
Well, with no further ado, let's begin with today's adventure, and we're gonna be talking about what's new in Dremio, which includes new generative AI capabilities, advances for 100% query success, plus again, Dremio is now on Azure, specifically Dremio Cloud. But with no further ado, I want to introduce Colleen Quinn, product marketing for Dremio Sonar, and Mark Shainman, [from] Dremio's product marketing [team]. Colleen, Mark, this stage is yours.
Mark Shainman:
Thank you. Everybody for attending our webinar today. My name is Mark Shainman, and I cover marketing for our Dremio Cloud product as well as our partner marketing. I also like to introduce Colleen Quinn.
Colleen Quinn:
Hi, everyone. I'm Colleen. I also lead product marketing here at Dremio, focused on our SQL query engine and unified analytics. Thanks for joining, and I'll be back with you guys in a minute.
What’s New in Dremio
Agenda
Mark Shainman:
Well, our agenda today [is] to touch base briefly on an overview of Dremio and give you some insight into our overall philosophy. [Then] we're going to dive into what's new in Dremio, and how you can specifically get started with the Dremio solution set.
Data Lifecycle Remains Complex, Brittle, and Expensive
Mark Shainman:
When looking out in the marketplace today, we interact with a lot of companies, and what we see is the data life cycle remains extremely complex, and that complexity causes a lot of expense and brittleness within their analytical ecosystem. We see a problem where you have your operational systems and those data sources, and there are tons of spaghetti ETL pipelines feeding that data into data lakes. Those can be data lakes in the Cloud, those can be on-premise data lakes, or even ones that are in a hybrid Cloud environment, and from there, then organizations in many cases push the data in ETL pipelines to their data warehouses.
Those can be on-prem, Cloud, or even multiple data marts, so there's a continual duplication of data, and numerous ETL pipelines that [need] to be created to move all that data around, and then also, at the same time, you have your clients who need to access the data. In a lot of cases, organizations are creating even extracts or cubes so their users can have fast interaction with that data and get their business insight.This remains an extremely complex environment, and there are numerous duplicate copies of the data that add to this complexity, as well as the problem of dark data within an organization. They have places where data exists, they don't even know it's there. That causes problems around governance, privacy as well as security.
Enterprises Are Moving to a Lakehouse to Simplify
Mark Shainman:
So what we see is organizations wanting to simplify their environment, decrease costs, as well as increase their analytical insight. Enterprises are moving to lakehouses to simplify that environment.
What they're doing is they're reducing and shifting left into an environment where you eliminate a lot of that spaghetti mass of ETL processes and duplication of data. You're taking the data, and moving it in an ELT manner into the data lake, so what you're doing is you're reducing the complexity, and reducing all those transformation pipelines, [and] moving towards SQL-based transformation, which is much more effective and much more performant.
Then that whole transformation life cycle will live in the data lake, reducing [the] overall complexity in the environment. The lakehouse has a huge advantage over the traditional architecture that I showed you previously, of having those multiple steps, all that spaghetti ETL having to move that data into a data warehouse. You're bringing that warehouse functionality on top of the lakehouse, so you have an environment where you have open data and open table formats. That gives organizations the ability to leverage the tool of what they want for the workload that they need. They're never locked into a specific vendor.
Of course, the advantage of separating compute and storage [is that] you're able to scale your storage or your compute based on the specific needs and workloads that you have within your organization. As I mentioned earlier, it's this no-copy architecture, meaning that I do not have to create multiple copies of data and move them around my environment and have that problem not only of multiple copies and data inconsistencies, problems of data governance but also that problem of causing dark data that exists within environments as well.
Shifting Left Reduces MTTI and Shortens ETL Pipelines
Mark Shainman:
So this whole idea is shifting left, moving the users closer to the data, so when you look at an environment where you're trying to shift left and simplify your overall analytical leave ecosystem, you need three main components. One is, [that] you need an intuitive self-service experience where you're enabling your users to easily access that data wherever it lives within your organization. You need a powerful, intelligent query engine that gives sub-second response times for those SQL queries, so your business users can have rapid insight into the data that they need to analyze, and of course, you need next-generation data-ops capabilities [that simplify] the overall management of that analytical ecosystem. You need those three core pieces of innovation, and that's what the Dremio Lakehouse brings to the table.
Dremio Cloud
Now Available: Dremio Cloud Data Lakehouse on Azure
Mark Shainman:
Now let's jump into Dremio Cloud, what we're bringing out, and some of the new components that we have within our solution set. What we've announced, which is now available, is a Dremio Cloud Data Lakehouse [that] is now available on Azure for public preview. Of course, one of the great things about this solution set is we provide a full, unified semantic layer, that allows organizations to leverage Dremio Cloud and access the data wherever it lives within their environment so that data can be an ADLS storage, it could be in other databases, or even in other no-SQL platforms as well.
We bring to the table a highly-performant SQL engine, which, as I mentioned earlier, allows organizations and users to get that sub-second response time to their queries and [the] questions that they're asking, and to gain that business insight in a very short time. Another key component that Dremio Cloud on Azure brings to the table is lakehouse management. You have this robust management capability that provides things such as git-for-data, so you're able to create branches of data [and] you can do experimentation and testing and not have to copy data, but at the same time, not impact your production environment. It provides a modern data catalog and built-in data optimization to optimize data that exists in Iceberg tables within your lakehouse environment.
Dremio Cloud in Azure: Eliminates the Pain of Managing Infrastructure
Mark Shainman:
So Dremio Cloud brings a lot of benefits to organizations who are looking for a robust, analytical lakehouse environment within Microsoft, Azure with Dremio Cloud, and Azure. You're always getting the latest functionality that's coming out within Dremio. There's no management overhead, so all the tasks, such as DevOps, to manage infrastructure, to manually upgrade infrastructure, create and manage certificates to have multiple URLs for connections, all that management overhead does not exist within the Dremio Cloud for Aero environment.
Of course, there's zero downtime, and all the upgrades are automatic, [so] something you don't have to worry about [is] automatic scalability. The environment dynamically scales based on the workloads that you have within your environment, so you have peak workloads during the day or based on seasonal. It dynamically scales and contracts based on your needs and your performance needs within your workloads. The Dremio environment offers end-to-end encryption, and we offer flexible capacity and pricing as well within our Cloud environment, so there are huge benefits for organizations who are looking for an effective analytical infrastructure in the Cloud and leveraging Dremio Cloud within Azure.
Dremio Generative AI
GenAI to Simplify Data Curation and Analytics
Mark Shainman:
One of the other key components that we are releasing as well is our generative AI capabilities. Within the market, we see that there's a lot of effort and a lot of organizations who are looking for the potential of what generative AI can bring to the table. There [are] numerous use cases for generative AI to advance analytical and BI workloads––things such as being able to prepare data, making cataloging and recommendations based on usage of the data, looking to specifically mask or redact data analysis of the data, generate dashboards or predictions, sentiment analysis, natural language processing––really enabling the non-technical user to have easy access, and be able to ask questions within analytical environments, doing things such as text to data, and being able to present data stories in an easy, readable fashion to your business users.
New: Easy Data Curation with GenAI
Mark Shainman:
What we're bringing to the table right now is new, generative, AI-driven easy data curation, so what you can do within Dremio Cloud is automatically generate wikis. It leverages generative AI to automatically generate data, set descriptions, and SQL examples within your Dremio environment as well. It can automatically generate labels and tagging, so it actually will go out, look at the semantics of the data, and create that label, create that metadata around the data that exists within your lakehouse environment. So it's highly effective, and it streamlines this management process within a lakehouse environment leveraging generated AI for this type of workload.
Look for the GenAI Symbol
Mark Shainman:
So when you're actually in Dremio Cloud, you can look for an AI symbol, this one, up over on the right-hand side, and all you simply do is click on that symbol, and it will show you what specific tasks you can leverage generative AI for, such as populating the wiki, giving columns specific metadata information.
GenAI for Data Engineers and Analysts
Mark Shainman:
One of the other things that we brought out earlier this year, that I wanted to touch base on and remind people of, [specifically for] analysts and data engineers, is our text-to-SQL capabilities. We're able to simply write things in natural language, as you would talk with a friend or somebody else, and it dynamically generates SQL. That, then, can be leveraged to query data that exists within your lakehouse environment. Now, I'd like to pass it on to Colleen, where she is going to talk about a unified path to Apache Iceberg.
A Unified Path to Apache Iceberg
Apache Iceberg: An Open Table Format for Enterprise Data Lakes
Colleen Quinn:
Awesome, thanks, Mark. I am super excited to be talking about our unified path to Apache Iceberg. As I think many of you know, Dremio is Apache Iceberg native. Apache Iceberg is an open table format that is being rapidly adopted across the tech sector. We recently completed a state-of-the-lakehouse survey that showed that Apache Iceberg is showing the fastest rates of adoption across the tech sector. Netflix created the Apache Iceberg table format to address performance and inherent issues, as well as usability challenges,in Apache Iceberg tables and large, demanding data lake environments. So as we start to see this acceleration of adopting Iceberg across the tech sector, Dremio has doubled down on Apache Iceberg as our primary table format, although if you're familiar with us, you also know that we support a myriad of table formats, including Delta Lake. We believe in meeting our customers where they are, and giving you the flexibility to use the table formats that make the most sense to you. We're excited to help bring our customers along with a unified path to Iceberg. Adopting a new table format can be difficult, and there are complexities associated with that,whether you're migrating from an existing table format, or you're storing all of your data in those raw file formats already in the data lake.
New: Unified Path to Iceberg
Colleen Quinn:
Dremio had previously supported this conversion process for CSV and JSON files. We’re excited to announce that we've expanded that support to include Parquet as well, so now, if you're storing any of your data in Parquet, CSV, or JSON, we have a one-click process to move your raw file formats into Apache Iceberg to accelerate your analytics. Let me show you a little bit how you do that. Let's say you have a folder of CSV files in your S3 bucket. Dremio previously won't auto-recognize that as a table, and before, you'd have to manually promote those CSV files, so Dremio [would] be able to recognize that as a table. There were [many] manual and time-consuming steps that we had already solved for with CSV and JSON, and remember, we're now adding Parquet to that mix, and we are just allowing you to do it with one simple function called ‘COPY INTO.’
Two-step conversion (and some housekeeping!)
Colleen Quinn:
What you're seeing here on this slide is this two-step conversion process, and some extra housekeeping, to make sure that your data lake is as performant as possible. Step one, you just create your table using a ‘CREATE TABLE’ function. The reason I say “conversion to Iceberg” is one step, is because if that table already exists, if you've already created that table with that schema format, you don't need to take that step again. Wo you create your table using that ‘create-table’ function. ‘CREATE TABLE,’ or ‘CT,’ as we like to call it at Dremio. You can also add a partition to the end of that create table statement. If you want to add hidden partitioning, that makes queries on the table even faster.
So that's step one. Step two––’COPY INTO.’ You copy your data into that Iceberg table just by running that COPY INTO query on the raw data, and again, whether that's CSV or JSON, and now Parquet, copy that right into the table, and you can do up to 1,000 files in a single batch. It's fast and you're done! You've now converted that raw data into Iceberg. It's just that simple, and you can see the SQL code snippet here. That took two minutes for our folks to write.
Now, there's some housekeeping that's optional, and we recommend this. Just [to] ensure that your tables are moving, that your analytics are moving as fast as they can. These are not new functions, these have long been available in Dremio. Step three––you can optimize your tables. Again, it's not required, but the ‘OPTIMIZE TABLES’ command ensures that your table contains the optimal file sizes, so that you have the most performing queries.
And then step four––you just clean up your tables, using the ‘VACUUM CATALOG’ function. Again, you're already done, you've already converted your raw data into Iceberg, but this step just performs the table cleanup for all the tables in your Dremio catalog by expiring snapshots and deleting unused data files, so this helps improve storage utilization for the lakehouse, and if you're already using Dremio Cloud and you're using our lakehouse management capabilities that we had previously, called Arctic, that optimization and cleanup runs in the background on all of your data once this is in Iceberg.
Ingest and Optimize Data Automatically
Colleen Quinn:
That's one of the major benefits of our lakehouse management capabilities, is that once you've ingested that data, that table optimization and that garbage collection happens automatically because we're so tightly integrated into these table optimization capabilities. So we'll automatically vacuum those tables to remove those unused manifest files, lists, [and] data files, and that cleanup runs in the background to ensure efficient storage and performance.
So now, COPY INTO, step one, and then automatically run that table optimization and garbage collection in the background, and your data lakehouse will be super performant and we'll get you off of those old raw file formats and table formats, so that you can take advantage of Apache Iceberg and all of the flexibility that it enables.
Expanded SQL Functions
New SQL Array Functions Available Now!
Colleen Quinn:
Moving on a little bit, I wanna talk about how we're continuing to expand our SQL functions and scope of SQL in Dremio. So Dremio is already an incredibly powerful SQL engine, and the the only one in the industry that was built to run at scale on the data lake, and we're always continuing to expand our SQL coverage. So in this release for both, Dremio managed self-managed software and Dremio Cloud. We've added an additional seven SQL array functions that you can see here, so we'll continue to expand these. We added about the same number in our previous launch, and we have a whole host of these coming in the future as well. We've got a new blog post on how to use the SQL array functions, too, so if you want to take a look at that, that's also available on the Dremio website.
Faster and More Performant
Colleen Quinn:
As always, I want to talk about how we are continuing to make Dremio even faster. With every release, our goal is to ensure that we're launching enhancements that make Dremio faster and more performant, and that's for both Dremio Cloud and Dremio self-managed software. In this most recent release, we have a handful of capabilities or enhancements that we're putting into Dremio to make that happen.
Let's start with our SQL query engine, which is at the heart of everything we do. For query execution, we have improved query performance and system resource utilization by about 15% using that TPC-DS benchmark, which is the gold standard for performance measurement. With our query planner, we're now using Iceberg statistics and optimization for superior out-of-the-box query performance––again, tapping into that deep Iceberg integration and native integration that is so prominent for us here at Dremio in terms of general performance. If you are a Parquet 2.0 user, we now support Parquet 2.0 using vectorized reader, and this improves Parquet 2.0 reading performance by a staggering 70%. And if you're a Tableau person like I am, we've always had an integrated native Tableau connector. You can launch Tableau directly from the Dremio UI, but we're excited to share that we have updated the Tableau connector using Flight JDBC. So this is a built-in Arrow Flight connector, and this enables about 30% improvement in Tableau performance when you're using Tableau against Dremio.
Dremio Self-Managed Software
Dremio Self Managed: Kubernetes Elasticity
Colleen Quinn:
Now I wanna give [our] self-managed [software] a little bit of a call out. We spent a little bit of time earlier in this presentation talking about Dremio Cloud, and how we're so excited that Dremio Cloud managed service is now multi-cloud in Azure and also in AWS. But for self-managed software customers, we want you to know that we're always continuing to make Dremio even easier to use. And so in the 24.3 launch, which is available now, we're excited to introduce that Dremio self-managed now includes Kubernetes elasticity. So you've got that K8S-based elasticit,y so that the infrastructure and concurrency-based elasticity rules are run automatically, works with our workload management, query routing, and concurrency. And so you've got all of these benefits now with that elastic scale for Kubernetes.
Ready to Get Started?
Colleen Quinn:
Let's talk for a quick second about how you can get started. If you're already a Dremio software customer, you do need to go and download the latest launch update for 24.3. You can do that at our support portal to download that file, or that upgrade, and you'll be able to take advantage of all the capabilities and Dremio software that we spoke about earlier in this presentation.
If you're a current Dremio Cloud customer, it's live. One of the beauties of running in the Cloud is the speediness with which you have access to all of the new features and functions in Dremio. Those features are available now, they're live, and you can go start using those generative AI capabilities that Mark was talking about, or running Dremio on Azure, today.
If you're new to Dremio, we're excited to see you, and we'd love for you to try it free. You can do that using either Dremio Cloud, or a free version of our self-managed community edition. So go ahead and go to the website, and take a look, and get started with that free version. As always, you guys, we’re excited to have you with us, and we're so excited to see all of these innovations coming in Dremio. Mark, thanks for letting me join you, and we hope to see you the next time.
Q&A
Colleen Quinn:
Okay, Mark, so I wanted to see if anyone in the audience had any questions for us. We did have one come in, and that question, I think, is best answered by you, so the question was: Is Azure the only Cloud platform on which Dremio Cloud is available?
Mark Shainman:
No, we both have Dremio Cloud in AWS and Azure as well, so we cover both of those Cloud providers.
Colleen Quinn:
Perfect, perfect, and then there was another one, and I'll take this one. The second question that we had submitted earlier today or earlier during this presentation was about table formats. So I had spoken a lot about Iceberg, and there was a question about other table formats that we do support. I want to be clear. We also do support Delta Lake––we do read on Delta, so, wherever your data [is], in Delta or Iceberg, we can absolutely read that, and as I think we made clear, we also support DML for Iceberg, so, all of that data warehouse functionality on the data lake using Apache Iceberg.
I think it looks like we just got another question here in the chat. Let me take a look at that one. Does Dremio support a connector to PowerBI in Azure?
Yes, the answer to that is yes, [there’s] native integrations for PowerBI and Tableau in Dremio and then also you can connect––sorry, Mark. I didn't mean to jump all over you there. And then also you can connect to any––so Tableau and PowerBI are native actually in the UI, so you can launch Tableau or PowerBI from the Dremio user interface. However, you can use any it tool that has an ODBC or JDBC connection, so [it's] super flexible, and so that the number of clients we support [is] much more extensive than just those two. So again, anything with an Odbc or JDBC connector is fully supported, and as I mentioned during my portion of our presentation today, we also now have that upgraded Tableau connector that uses Aero Flight that is about 30% faster, which is super significan. If you're using that version of Tableau you'll see those performance gains straight away.
Let's see. I think we answered that question. Are there any other questions today from this group that we wanted to address while we were all here? And [I] just keep seeing the same one…I think…so, okay, it doesn't look like it. I wanted again just to one more time thank everyone for their time, and encourage you to again, if you are a Dremio customer in managed software, go ahead and just download that upgrade from the support portal. It's super easy to do––just go to the support portal, and there's an option for downloads, and it's in there in the 24.X folder. So you can see that it's the latest one, this is 24.3. If you're a Dremio Cloud user, congratulations, you don't have to do anything. It's there for you already, and if you don't know us yet, as Alex said at the beginning of our conversation today, please go ahead and give the free trial a try. You can just go to the dremio.com website and get started there with either Cloud or software, and we'll see you next time. Thank you so much, everyone.