Dremio Jekyll

Recognizing A New Tier

After nearly two years of R&D, I’m excited to announce the launch of Dremio today. Being able to share what we’ve been working gives me a great sense of pride for what this team has been able to build.

We started Dremio to create a new product with a simple goal: make data more accessible. We’ve talked to 100s of people about data problems over the last two years, from Fortune 100 CIOs to in-the-trenches data and infrastructure engineers. Through many iterations, we’ve created a product which we believe can establish self service data as a new paradigm for the demands of modern data.

2005-2012: The Rise of Self-Service BI

Early self-service BI trail blazers like Qlik and Tableau looked at traditional IT enterprise and said there must be a better way. Getting to reports shouldn’t have to go through month-long requirements. People need to be self sufficient. On this simple idea, they built great products and were rewarded with passionate and appreciative user-bases. After the BI renaissance of the 2000’s, business users found a better existence.

2010-Present: Developer as Data Decider and Database Specialization

Over the last decade the same thing that happened to data analysts in the 2000’s came to developers. Developer-focused data systems became mainstream. What was once an IT-driven rigid set of processes around schema management and entity modeling became a fluid and flexible set of technologies. Technologies like MongoDB, Cassandra, S3 and Hadoop all catered to a common, developer-driven design and implementation paradigm. Apps could be built in days instead of months. Schema and storage became afterthoughts, relegated to sit behind decisions related to scale, latency and operational efficiency. These changes needed to happen. Volume of data combined with constant changes in requirements required a new model.

All this flexibility gave great productivity to the engineers developing applications but rolled back the self-sufficiency of data analysts and data consumers. They were reliant on accessing data but the data wasn’t in the required shapes. In many ways, we’ve regressed. Many business users and data analysts are just as reliant on data engineering today as they were on central IT and data warehousing teams in the early 2000s.

Moving Forward: Rethink and Re-tier

At its core, this problem is a simple one: there is too much coupling between the physical nature of data and the business focused requirements of end users. We need to be able to reclaim the self-service BI golden years while also supporting a highly flexible, developer-focused, and constantly changing DBMS landscape. In short, why can’t we have both highly productive developers and data consumers?

Engineers know that a simple solution to too much coupling is adding a level of indirection. That is what Dremio is at its core: a level of indirection that provides a data playground – an analysis and collaboration space for data consumers – while abstracting away the painful and complex physical realities of many data infrastructure products. This layer ensures that neither data consumers nor developers are bound up in solving for changing business requirements, massive data scales, and emerging DBMS and file system technologies.

Dremio: A New Vision for a New Category

Dremio is a free & OSS product designed to be inserted between your existing data infrastructure and your business users. It works with your existing investments in both data infrastructure and BI tooling. Just as VMWare created a new software layer between the physical and logical concepts of compute, Dremio creates a software tier between the physical data layer and the logical business layer. We call it a self-service data platform. This new tier will redefine how data organizations operate. We think business users and IT will be happier and more productive.

Consumer, not Techie

The rise of big data technologies has brought us lots of very cool technologies. The problem is that most people need solutions, ways to solve problems that don’t require heavy technical expertise and investment. At Dremio, we invested heavily in a friendly consumer UI that we’ve open-sourced, a pleasant place to collaborate, shape, blend and understand data. This UI is a core part of the technology and works hand-in-hand with a massively distributed computation platform. Connect to multiple data systems and start working with data in a few minutes with no programming or database skills. Do complex transformations like complex object unrolling, field extraction and pattern canonicalization, and then share those results with others by simply pointing and clicking. It’s like Google Docs for your data.

Blending of Logical and Physical

To redefine data consumption, users need to be able share and collaborate around data. They should be able to build on each others’ ideas and not worry about details outside their business context. Foundational data tiers (physical and basic cleansed data) can be built by data engineering and central IT. This data can then be blended and brought closer to business by analysts and presented in a business specific tier, presented to non-technical business users. Whatever the organizational need, logical tiering can ensure minimal copies, clear separation of responsibilities, and a clear provenance of data assets.

Managing the Land of 10,000 Data Lakes

Since the best data infrastructure technologies today will be different tomorrow (or later today), a self-service data platform should treat physical and logical representations of data the same. In Dremio everything is simply a dataset. You can analyze, shape, combine and export a dataset. Today some of these datasets may be coming from a combination of Postgres, Oracle, MongoDB, and Hadoop. Tomorrow, there will be different developer requirements. Developers should be able to move to new tech without impacting data consumers.

Make Big Data Feel Small

Today’s data scale is massive. Analysts need to make fast decisions and iterate. Analysis isn’t about just looking at one report, it’s about drilling-in, slicing and predicting. In many cases, the self-service BI promise fails due to the latency of analysis on terabyte and petabyte-scale datasets. To deliver true self-sufficiency, a self-service data platform should be expected to deliver data faster than the underlying infrastructure. It must understand how to cache various representations of the data in analytically optimized formats and pick the right representations based on freshness expectations and performance requirements. And it must do all of this in a smart way, without relying on explicit knowledge management and sharing. Dremio Reflections are a sophisticated way to cache representations of data across many sources, applying multiple techniques to optimize performance and resource consumption. Then a user’s interaction with any dataset (virtual or physical) can be autonomously routed through sophisticated algorithms.

Open Core and Built on Industry Standard OSS

Investing and relying on a new tier in your data stack is necessary. You need to find a solution built on open source technology, that itself has an open source core that is built on industry standard technologies. Dremio houses a powerful execution and persistence layer built upon Apache Arrow, Apache Calcite, and Apache Parquet, three of key pillars for the next generation of data platforms.

Conclusion

We are at the birth of a new time where infrastructure and analysis can both be flexible, fast, and adaptable in a collaborative way. I am proud to be part of the group who is pushing this new vision of self service and self sufficiency.

From smart substitution to native pushdowns, vectorized computation to physical data control and asynchronous execution, Dremio is pushing innovation in a number of different ways. In the coming weeks, I’ll cover more.

In the meantime, checkout our GitHub, join the discussion on our community site, Download Dremio, and let us know what you think!

See you soon, Jacques