13 minute read · March 17, 2022

Operationalizing the Data Lake: Highlights From Subsurface LIVE Winter 2022

Sanjeev Mohan

Sanjeev Mohan · Principal, SanjMo

Dremio hosted Subsurface, the Cloud Data Lake conference from March 2 to 3, 2022, with a record-breaking audience and two days of informative content. The star of the event was… wait for it… Apache Iceberg. More on this later.

It seems like just yesterday when I hosted an open data panel at Subsurface in July 2021. Like last year, this one was also virtual, although its host, Dremio, announced they are bringing the event in-person to select cities later in the year.

Day 1: Career Advice, Dremio Cloud, and Bill Inmon

The event opened with the Dremio CEO, Billy Bosworth, bringing his college football player-coach panache to reassure us that now is a time for the “great investment” in our careers and to always be learning. He exhorted the audience to distinguish oneself by becoming better at one’s craft, and to actively take part in the community and networking events. It was refreshing to see a leader of a technology company address the people-side of the industry, and to do so with empathy for the individual who is grappling with a pivotal moment in their career. He reminded us to not bemoan the “Great Resignation,” but to seize the moment and take advantage of the unprecedented innovation around us. 

Tomer Shiran, Dremio Founder and Chief Product Officer, made the big reveal of the conference. Dremio Cloud, a managed service which comprises two products, Dremio Sonar and Dremio Arctic, is now generally available on AWS. Moreover, Dremio Cloud features a “forever-free” tier, along with a commercial option that provides advanced security integrations and support. Dremio Cloud reduces the complexity of running SQL workloads, while also automating data management operations.

Dremio Sonar is a SQL engine that leverages open source Apache Arrow, a columnar in-memory format, to deliver millisecond query performance on data in the lakehouse. Sonar enables you to meet your analytic requirements while bringing the engine to your data as it is, rather than needing to copy your data into an engine’s proprietary format.

Dremio Arctic is an open-standard metastore. It leverages Apache Iceberg, another open source technology that provides transactional capabilities like insert, update, and delete, at record-level for Apache Parquet files on object stores. Arctic enables data lakes to get the functionality users expect from their data warehouses, with the advantage that users continue to use open standards.  It also gives users the flexibility to use the compute engine of their choice — Spark, Flink, Sonar, Trino, Presto, etc.

Besides Iceberg, Dremio Arctic includes Project Nessie, an open-source metastore that replaces Apache Hive’s metastore, which Tomer says is the very last vestige of the Hadoop ecosystem. Arctic gives developers Git-like functionality for their data. Instead of making copies of data to test new experiments, Arctic allows developers to branch and merge changes without impacting production workloads. This enables many use cases that have previously required a data warehouse, such as multi-statement transactions, and use cases that have not been possible in a data warehouse, like lightweight and safe experimentation.

When the father speaks, the children listen attentively. This is what happened to the vigorous online chat when Bill Inmon, the father of data warehousing, came on to present his keynote address. The chat went silent as we listened, enraptured, to Bill. In the typical Inmon style, he walked us through his four decades in the industry and how data and analytics has progressed. The punchline of the keynote was that the future of analytics is the cloud data lake, built on open standards.

The rest of the day was dedicated to educational sessions by end users, such as Mercedes-Benz, Fannie Mae, and several others, plus software vendors, including Snowflake, Acceldata, and others.

The final highlight of the day was the Women in Data panel led by Dremio’s Deepa Sankar. This panel brought four inspiring speakers who shared their thoughts on what it takes to break through gender stereotypes in a heavily male-dominated field and have a successful career in the fields of data science, analytics, and mathematics.

Day 2: Founder’s Panel, AWS, Microsoft, Cal Newport, and More

As day 2 unfolded, the chat was alive again with the audience sharing their wish list of topics. Apart from the usual technical topics like Apache Airflow, I was surprised to see persistent requests for data strategy and governance. It shows how we haven’t yet fully addressed some basic security and governance subjects. Another area that seems to beg for more attention is ESG.

The Founder’s Panel was a lively start to the day. Scott Gay from Dremio moderated the panel that comprised Tomer Shiran, Max Beauchemin, Tristan Handy, and Ryan Blue, all founders and stakeholders in data and analytics companies who have embraced open source technology to some degree. A key message was that although proprietary technologies still predominate, “open-source is eating software.” Tristan had, as always, some of the best nuggets to share. Not a coder by profession, he is firmly in the SQL camp. He says the mentality has to shift from the developer experience to the analyst experience. At one point, he started telling us about impostor syndrome, but luckily, the topic changed before the hard truth hit home for many of us practitioners!

The panel mused, “will SQL ever die?” All agreed that SQL got one thing right - its declarative nature. End users should not have to worry about parallelism, partitioning, etc., as simplicity is key. Some panelists pointed out that SQL was not designed to perform complex, dynamic and object-oriented primitives.

The BI tools already provide some level of abstraction, like aggregate awareness, but what we need is a “transformation-aware database.” Tomer mentioned that Dremio’s engineering team is working to automate its materialized views, which are called Reflections.

Partner Fireside Chats

AWS and Microsoft both took turns at their respective fireside chats. Mark Lyons, Dremio’s VP of Product Management, interviewed Paul Meighan, Head of Product Management for Amazon S3. Paul gave a thrilling view of how S3 has risen to take on the mantle of data lake persistence to deliver modern analytical needs. Amazon S3 was one of the first AWS services launched over fifteen years ago, yet it remained “eventually consistent” for the first 14 years of its existence! Its three areas of focus lately have been: 1. Strong consistency, 2. Intelligent storage tiering to optimize cost performance, and 3. Least privilege security posture. 

The Microsoft fireside chat had Jürgen Willis, VP responsible for storage and Arun Ulagaratchagan, Corporate VP for the Intelligence Platform, including Power BI. Tomer was the host and we got a fascinating view into the phenomenal growth of both Azure’s various storage products and Power BI. Power BI, in its seven years of existence, has now over 300,000 customers and users in 187 countries. Its most downloaded artifact is its Power BI security white paper. Arun says Power BI is “PowerPoint for data.” Enough said…

Cal Newport and Deep Work

The conference ended on the same high note as it started. Once again, the focus shifted to the people-side of our lives through the importance of performing “deep work.” Georgetown University’s Computer Science professor, and author of several books like “Deep Work,” “Digital Minimalism,” “So Good They Can’t Ignore You,” and “A World without Email,” Cal Newport, delivered the closing keynote. Unfortunately, this session was only available live. The center point of his talk was that one must allocate time when one is bereft of distractions and perform work with intense concentration. This is what Cal calls deep work. His studies have shown that people who practice this “skill” produce twice as much as their peers.

Through vivid illustrations and examples that included Carl Jung’s home in Bollingen, Switzerland, Cal encouraged listeners to go beyond their comfort zones and eschew distractions. This is the only way, he says, that knowledge workers like us can keep up the avalanche of recent developments around us and produce high quality and quantity work. He quoted research from others in the space of psychology that shows that “cognitive detour” leaves behind “attention residue.” Every time we switch to check emails, or Slack, we pay a “cognitive tax” of at least five minutes. Cal gave practical tips on how to prioritize deep work. 

Breakout Sessions 

The conference comprised about 45 breakout sessions, with speakers ranging from Dremio customers and partners to technology experts and open source innovators. Dremio provided deeper insights into its two major launches—Sonar and Arctic. Sonar enhances developer user experience across mission-critical SQL workloads to ad hoc analysis. It includes SQL Runner, an IDE, and SQL Profiler, a visual query profiler. SQL Runner allows auto-complete, multi-statement execution and sharing of SQL statements. SQL Profiler can be used to tune query performance. 

Here are a few quick hits from the breakout sessions that I feel are worth a mention:

  • Douglas GmbH, a German perfume and cosmetics retailer, took 3 days on average to build pipelines to make data available to customers. By accessing data directly on the data lake with Dremio, they claimed that they reduced the “time to data” to just minutes.
  • There were a number of sessions focused on Apache Iceberg. In one such session, PMC Member of the Apache Software Foundation Russell Spitzer shared how Iceberg maintains table history, and explained how to use data optimizations to manage files in Iceberg.
  • Mercedes-Benz shared their multi-cloud solution for bringing together AWS and Azure to foster data sharing and discovery.
  • Philip Portnoy from Wayfair shared how they created a real-time hybrid cloud data streaming pipeline using Apache Beam, Google Cloud Platform, and Dataflow.
  • One of the most popular breakout sessions was focused on data observability. Data pipelines are increasingly becoming open source and code, and Tristan Spaulding, Acceldata Head of Product, discussed why observability is so critical in this evolving architecture. 
  • Yours truly moderated a Birds of a Feather session on the esoteric topic of data de-identification and soft deletes, with the founder of Okera, Nong Li. The punchline was that average software stacks have many dependencies, and each subsystem is constantly updating with different versions, leading to a higher surface area for failures and security breaches.

There were so many sessions, with a great mix of technical deep dives and how-to’s, customer stories, and partner sessions with technologies that integrated with Dremio. I, for one, will be diving back into many of the sessions I was unable to attend due to overlap.

The Road Ahead for Dremio and the Next Subsurface

With that, we came to the end of another action-packed Subsurface. In reflection, we got to see a more confident Dremio with a very clear game plan. It recently raised a $160M Series E, bringing its total funding to $400M, and a valuation of $2B. Enterprises have used Dremio’s well-engineered products for a long time, but its story is now more complete and cohesive. Its focus on open standards and the arrival of perpetually free Dremio Cloud will serve its customers for many years to come.

With 45 breakout sessions, this was the biggest Subsurface event yet. Until the next one…. 

How to Watch Subsurface On-Demand

If you missed the live event, you can still register until the end of March and watch sessions on-demand through your event portal.  After March 29, 2022, sessions and presentations will be available to all in the Subsurface Community Resource Library.

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.