Dremio Jekyll

Subsurface LIVE Summer 2020

Five Data Trends You Should Know

Session Abstract

Expand your technical knowledge and hear from your peers and industry experts about cloud data lake use cases and architectures at Subsurface™, where we explore what’s below the surface of the data lake. Hear firsthand from open source and technology leaders at companies about their experiences spearheading open source projects and building modern data lakes. Explore real-world use cases, from data warehousing and BI to data science and advanced analytics.

Presented By

Tomasz Tunguz, Managing Director, Redpoint Ventures

Tomasz is a managing director at Redpoint. He is an active blogger at tomtunguz.com and is co-author of Winning with Data which explores the cultural changes big data brings to business, and shows you how to adapt your organization to leverage data to maximum effect. Before joining Redpoint, Tomasz was the product manager for Google’s AdSense social-media products and AdSense internationalization.

Tomasz attended Dartmouth College, where he rowed on the crew team (Go Green!) and graduated as a George Revitz Fellow with a BA in mechanical engineering, a BE in machine learning, and a master’s degree in engineering management.

Webinar Transcript


Ladies and gentlemen, please welcome to the stage managing director at Red point Ventures. Tomas Tungus.

Tomasz Tunguz

Thank you for the introduction, Jason. I'm thrilled to be here. My name's Tomas Tungus, and I'm a managing director at Red point Ventures. I write a blog at tomtunguz.com and it a data infused collection of posts about startups. But I'm here today to talk to you about five data trends that you should know that we've been observing as venture capitalists.

Just to kick things off, let me tell you a little bit about Red point Ventures. We're a venture capital firm based in Silicon Valley, we invest anywhere from $1 million to $15 million, primarily in companies in the US, and we are a group of founders and operators who've founded startups, operated hyper-growth companies and help startups scale to terrific Heights. These are some of the companies that we've worked with, we've had the privilege and the fortune to work with 26 unicorns over the last 10 years, and some of the most iconic companies and software. In aggregate, they represent more than 25 billion in market cap.

And these companies include Stripe and Harpsichord, Twilio , Duo security and Zendesk. We also have deep domain experience and data. We were early investors in snowflake and Looker and Dremio. To give you a sense, we evaluate something like 7,000 investment opportunities annually, and this presentation is really meant to distill some of those trends that we see in market. I'm really passionate about data. I was first exposed to the power of data, studying machine learning in school. And then I went to Google and I saw there how large scale volumes of data can build incredible businesses. I'm so passionate about data that I ended up co-authoring a book called Winning with data with the founder and CEO of Looker. And then this book, we researched the challenges that modern organizations face with data and how the best companies in the world mitigate those challenges and transform them into advantages that give them long-term competitive advantage.

I'm here today to talk to you about five trends, but underpinning all those five trends. There's one mega trend. And that mega trend is the rise of data engineering as craft. The word data engineering is new, and the idea is important. We think data engineers will define the next decade. 10 years ago, the people working with data, moving it, shaping it, slicing It came to it from many different backgrounds. Some came from finance, they were analysts. Some have statistics, backgrounds. Others came from customer support like me, and they all found themselves in data roles, the convergence across all these disciplines occurred because data has become a critical part of every modern company's technology stack. Data has become essential. And so now at the time that companies aren't investing in specialized people, processes and systems to maximize the benefit they get from data.

And the reason Data engineer is so important is because data has become ubiquitous. Data is everywhere. The reason data's become so ubiquitous, it's because it costs much less to store data than it did 20 years ago, 20 years ago you would take your data, you would filter it, you would push it into an Oracle database. And the more data you put in the more expensive it was because you had to buy Oracle licenses. So we filtered the data aggressively, but today we store exabytes of data and files on S3 because we can afford it for the price of two oat milk Macchiatto's blue bottle, I can store half a terabyte of data on Amazon S3 for a month, so everybody can afford it. And we store buckets of it, reams of it, mountains of that data, since we have all that data at hand, we decided to use it makes sense.

20 years ago, it was IT buying the systems to extract value from the data. IT procured the systems, installed and managed those systems. But starting 10 years ago, forward thinking teams decided to do it for themselves, IT was too slow. A modern marketing team couldn't wait three to six months to get the answers to their questions. They would be toast in the market. So they ended up buying their own system and that marketing team created their own data products. What are these data products? At first, they were dashboards. How many new clicks did we generate? How many new leads, how many customers, how much ad spend? Then the marketing operations teams were hired and they became more sophisticated. They started to run different scenarios to test different ideas, experiment new techniques. Today the modern marketing team has a panoply of machine learning systems stuffed to the gills with first and third party data.

It's basically a quantitative hedge fund for buying online ads. And that transformation happened in less than 15 years. Those predictive systems end up creating data of their own. And that data is also stored in process. So this process that starts with data costs less the store going all the way through the data product creation is the flywheel. It's a flywheel that goes faster and faster and faster. It's in fact a massive digital boulder of ones and zeros coming down the hill at top speed. And the problem for most companies is that this boulder isn't just in marketing it's in every department.

Let me explain. So 20 years ago, this is how the data world looked at the highest level systems produced data, whether it was logs, transactions, actions, customer actions on websites, the data was filtered put into an Oracle database or an SAP database, and then pumped into a legacy output system like a Cognos or Tableau. This worked for small data volumes, it was expensive and flexible and closed. Pop quiz, how long does it take to update a marketing reporting Cognos? answer too long. Your business is dead, but at that time that was state of the art. And when the exec team got one everybody wanted one. Each team manager saw others having success with data. People wanted the authority, the command of the business, the ideas that float from the data in each of their teams. And so each team decided to buy their own data systems.

IT couldn't keep up and so the consumerization of IT was born, particularly data to give you a sense of the scale for every $1 spent on technology from IT about 47 cents was spent by each of these individual teams during the consumerization of IT movement. At the beginning these departments built small systems, but over time they ended up hiring operations teams which is doublespeak for data analysts and data engineers, to help them understand what was going on within their departments. And as a result a 1000 digital flowers bloomed, and they grew and grew and grew, but pretty quickly that garden became overrun with complexity. There were leaves and thorns everywhere. The marketing team decided they needed access, not just from marketing systems, but other systems. They needed access to the CRM database from the sales team to understand customer value. They needed customer support data to understand customer life cycles and the billing data from the finance team and a bit of product data, those web analytics, measures informed customer conversion. And so wasn't just marketing that was taking data from other teams.

Each department needed data from each other, and that created a completely new concept. And this concept is called a Data Mesh and a Data Mesh is a network of data, producers and consumers within an organization. And the idea here is that each team is responsible for producing its own data, publishing that data via API or a common format. It's responsible for documenting that data, explaining the lineage, keeping it up to date. So other teams can use it and rely on it to make decisions, in exchange every other team within a company does the same. And this is what creates the mesh, it's what allows the organization to use the data, create those APIs and develop increasingly sophisticated data products at scale.

And there's an important next step, which was this data mesh actually moved to the cloud. A modern companies moved it all to the cloud because in the cloud data is elastic, it's cheaply maintained by somebody else and accessible by everyone who needs to access it with the right IAM permissions of course. More importantly teams stored data in these cloud data lakes and open source formats like parte and arrow, the ones that Tom had told you about this morning. These formats have lots of benefits. They accelerate queries, they create a single standard that makes it easy for every tool within our organization that you might have today to talk and use of the data. That's the vision. That's where we're all going. But all of us are in different States of getting there. And the reality of course is much more complicated than these beautiful diagrams. And so without the right tooling, you don't have a data mesh, you have a data mess.

Each team has their own tools, data storage depots, and infrastructure. It's basically a big bucket of Legos. Systems don't talk to each other, there's confusion about the three different definitions of revenue across three different departments, nobody can find the data, where's the customer support data table. And so you have a problem and those four problems are these four problems. Those four problems are the four consistent problems we saw in companies when we were researching this area. The first sort of data bread lines "I have a question about the business, let me go and ask the engineer I met at lunch. She'll do me the favor of pulling the data". A problem with data breadlines is that the line is invisible and you have lots of people waiting around for answers, data answers to answer their questions. The second is data obscurity or rogue databases.

I operated a rogue database when I was young. I asked an engineer at Google to run a MapReduce job to give me a subset of the Google search query. And then we actually built a competitive dashboard to compete with Yahoo on it. No one knew the database was sitting underneath my desk on a server, no one validated the data, but we made lots of different, meaningful and important decisions based on it. The third problem is data fragmentation, figuring out where your data is. You see the dashboard in front of you, you know, the data stored somewhere in the company but where is it, who owns it? And then the last and the most colorful our data brawls. Those are the fights between teams about the definition of a metric like revenue or payback period.

And when we researched the market before we found all these problems consistently, even in very advanced companies but the ultimate vision as it's always been with data systems is to put all this together and develop a breathtaking machine that enables a company to grow faster. I can tell you from working with some of the most innovative companies here, when you do achieve this vision it's completely transformational. It enables companies to move faster, grow faster, and outperform the competition,

Getting there and building that machine is not easy. So the question that comes to us is when you put up the bat signal to solve these problems, who is going to come to save the day? There's a simple answer and it's the data engineer. And that's why this role has evolved because the complexity of these data systems has gotten to a point that we need specialized people to manage this infrastructure and empower everyone within a company to use data effectively. I believe that data engineering is the customer success of this decade. It's a new role, it's critically important and it's going to be the discipline of the next 10 years. Although I can't see you I'm confident many of you in the audience are exactly the superhero, maybe minus the Batmobile. What is the data engineer? Data engineers are the people who move, shape and transform data from where it's produced to where it's needed consistently, efficiently scalably, accurately, and compliantly.

They've got many different skills. But as I said before, they're going to be an absolutely essential function and what they really are. They're software engineers deep and data, they're software engineers in disguise. And when we were researching the space, we had this insight talking to the different software engineers that software engineers have decades of experience, writing software, building tooling and patterns of writing code. One of the things that we set the most frequently is this, which is the cloud native computing foundation software development life cycle. It's an uroboros so snake eating its own tail, it's this infinite cycle.

And it's a consistent process for how to manage modern software releases. Vendors within the ecosystem use this in their sales pitches, managers actually use this to manage their teams and understand exactly what's going on. Pretty quickly, I'll just run through it: first you plan the software you want to build then you code it, you build it, you package it, they ship it, you test it with a testing harness, you release the software, you deployed across your cloud. You operate it, monitor it and you repeat. So the question is, what is the data engineering equivalent to the software development life cycle? Most companies, in fact all of the companies we talked to don't have this notion yet, But You have an idea.

This is what we observed in the market, the data engineering life cycle, it was a six step cycle. First you ingest data from whatever data producers exist and you store it in something like Amazon is three in a cloud data Lake then you plan what it is that you want to build, you then use a query engine to run queries across all of that data. You model the data, which is the work of defining a metric wants in a central place so that everyone within a company can use it. You then develop the data product that can be an analysis of BI dashboard or a piece of machine learning, machine learning API or even something like a recommendation system that might be embedded in your product. And then you monitor and test it to make sure that the data continues to flow normally. And there are no abnormalities. And as I said before this cycle actually creates more and more data which is then saved ingested, and then used again.

Each step of this data engineering life cycle needs new tools. And these are the five trends that I'm going to be sharing with you today. These are the five different steps of the data engineering life cycle that we think are going to be important. The first are data pipelines, these are the water mains of data. Data pipelines have been around forever but the main advance that we've seen in these data pipelines are the modern ones use modern computing languages like Python. Second innovation is that they create high levels of abstraction that enable engineers to reuse code across pipelines. All of us who operate data can remember creating one-off scripts that break and are super brittle. And the vision behind these companies is really to eliminate that. The third thing they do is they monitor these data pipelines. And then the fourth is they actually help you visualize Dax or distributed sorry, directed a cyclic graphs which basically show you all the different steps involved in taking data from the source to the sink.

And the innovators in this category airflow coming out of Adobe elemental. You may have heard Juliana speaking about Marquez earlier today. These are the people who are pushing this forward, and it gives you a sense of what this looks like. This is a, the prefect UI on the left-hand side, this is a visualization of a data pipeline or a dag. And on the right-hand side you see the monitoring or the systems that are used to help you understand the health of those data pipelines. The second major trend that we see are compute engines, compute engines, query data within a cloud without having to move it. This enables teams to get access to all the information they need from a single place in a cost-effective compliant and fast way, even better. Most of the time, it's sort of an open source formats, which basically allows you to future proof, your data.

These compute engines are the execution layer that sits on top of all these open format files. They accelerate queries, they make them that fast or not just for a single user, but for everybody who wants to access that data.They reduce the cost because you don't have to move data around and they eliminate data lock-in. And as Tom had talked to you about this morning, we're seeing this trend of these compute engines across literally every single kind of data query. The third major trend is data modeling and data modeling is to define the metric once so that the sales teams and the marketing teams both agree on the same definition of revenue. And they don't get into data brawls with each other. The goal of data modeling is to ensure that the entire company is aligned on a single number. I'm sure we've all lived through a meeting where we were arguing about a topic.

Can we each got a different number of revenue or lead count or payback period modeling is all about creating an owner of a metric explaining to everybody else what that metric is describing the lineage of that metric so that everybody knows exactly where it's coming from and how it's calculated. And as a result of that, you can make the best decision. The leaders in this category, one is called transformed data, and it gives you a sense of what this looks like on the right hand side. You can see this is a YAML file that basically defines three different dimensions, the ID, the inventory item, and the order ID. And here they're defined once they're encoded in code committed to get hub. And then you have your data definition.

The fourth category are data products, data products are the insights, the analytics and the software built using data within a company. And there are two big buckets of them that I'll talk about today. The first are the next generation data visualization companies. And the one that stands out is Preset and you may have heard Max who's the founder of Preset speak early today. Preset enables teams to visualize trends within their data, share the insight with others and then publish them on an ongoing basis to key stakeholders. Preset is a company that's commercializing an open source software called Superset, which was created Airbnb. And it adopts many of the open principles that we've talked about with the rest of the ecosystem. In addition to the business intelligence world, there's a parallel world of machine learning and machine learning tooling, to give you a sense, there are hundreds and hundreds of tools, and we spent many months grappling with a number of tools that are coming out of some of the most innovative companies in the world. To stand out that I'll walk you through today.

The first is streamlet, which enables machine learning engineers to share their models with non-technical users, either for direct consumption like a recommendation system for a customer support tool that recommends email to different responses or for training those models. In the case of say an autonomous vehicle company that wants to refine a particular model, to give you a sense of what these products look like. This is a visualization that preset output of its mapping capability of a data set in San Francisco. And on the left-hand side you're basically able to choose and filter on the right-hand side you've got a beautiful map. This is an example of what streamer does, on the left-hand side you've got a Python code, again, an open modern language that allows you to take a machine learning model. And on the right-hand side quickly create a web UI.

And in this case, what's going on is this is actually a use case for autonomous vehicles that allows an end user to figure out what the right parameters should be to set that machine learning model at optimum, to turn it to the optimum state. And then the last category is data quality. This is the fifth major trend and the data trend was data quality was a wave in the late 1990s. IBM bought one or two companies, but it's actually disappeared for about 20 years. And if you search for data quality you won't find a company that's been in this category for the last two decades. And you won't find it either in modern DataStax until about the last two years.

The idea behind data quality is to again draw a parallel from software engineering, software engineering has many different systems to ensure that new code operates well. There's a battery of performance test load testing. There are functional tests, unit tests, and even concepts of test coverage. What fraction of my code is actually covered with tests. In addition to those tests, there are monitoring tools and anomaly detection tools. When I push something into production, is it behaving the way that it used to? Is it behaving in a meaningfully different way? If you look for the analogy within the data world, you won't find one, it doesn't exist. And the issue with that is it manifests itself in the worst way. I'm sure you've all been in this situation where you're presenting data either to your CEO or a customer, and the data is wrong and you don't see it but your CEO sees it right away. And at that point you've basically instantly lost credibility for a quarter or two. Data quality is meant to solve those issues and basically eliminate those conversations.

Some of the innovators, you may have heard Monte-Carlo BARR speaking, great expectations is an open source Python library. So to data and data gravity, or to others in the space. And to give you a sense, there's basically two different places that you can implement these data quality measures. The first is basically pre production code. And this is a snippet that comes from great expectations out of Python, which it written in Python, which says "I expect a particular columns values for this column room temperature to be between 60 to 75 degrees with a 95% confidence interval". This type of data integrity testing is like functional testing and software. If engineers know what to expect, it's an effective tool. It does require writing a huge battery of tests and having a test coverage metric. That's similar to what we use in software, but if you've got a really good sense or an expectation about what the data should look like with time, then this is a fantastic tool.

And then there's another approach that uses machine learning. And these are screen grabs from Soda Data companies like Soda Data, and Monte Carlo use machine learning to understand data patterns, and then discover anomalies. So these anomalies might be differences in data volume. So suddenly if a data stream falls, you want to know about it or maybe there's a change in the distribution of data, it used to be a galaxy and distribution and now it's a zip F and that's got massive implications for your downstream machine learning models. The machine learning approach actually takes borrowed lots of different techniques from security systems. And the benefit of this is that the system is basically autonomous.

You install an agent it learns a bit about what's going on in your data stream. And then it makes a lots of predictions and the text anomalies, the challenge of course, in any machine learning system, both in data quality and in security or any other kind of anomaly detection, is you really want to make sure that the signal to noise ratio is strong and meaningful. Otherwise users sort of tune out the results, but the initial indication suggests that this is an absolutely fabulous technology.

So in summary, you know, these are the five data trends that we're paying meaningful attention to the first data pipelines, moving data with modern code and modern monitoring, so that you can guarantee that everything is flowing smoothly. The second are compute engines using open source technologies and standard document formats to query cloud data without having to move it in secure ways that are much more cost-effective than an enterprise data warehouse would be.

The third is data modeling, which is the elimination of the data bra let's define a metric once for the entire company. Let's put it in a place where everybody can access it, understand the lineage, the components, the owner, so that I can go and escalate if I have a question. The fourth are new tools around data products, all the outputs of these pipelines. We want to squeeze data from inside that we want to squeeze insight from data, and that's going to come in two different forms. It's going to come in the forms of modern BI platforms that help you present the reports on an ongoing basis to key stakeholders within the company, and also generate the machine learning products, whether they be products that are used for internal purposes or external consumption. And then the last is data quality, which is the processes and the practices of ensuring that the data is arriving from its source to its sync exactly the way that you expect so that you're always making decisions with the right data.

It's really early in this decade of data engineering. I see it as we're six months into a 10 year long movement. And the future depends on you. We need data engineers to weave together all of these different novel technologies into a beautiful data tapestry. It, these are not easy problems. And the landscape underneath you is changing all the time. There are new software tools to learn their legacy applications to wrangle with. And there are lots of different demands from everybody within the company to get to get them exactly the data they need exactly when they want it, which was yesterday. At red point, we believe that this decade is the decade of the data engineer. We believe that it's an entirely new role that specializes in the critically important functions of getting data from the places it's generated to the places that can create insights and unlocks powerful decision making ability within the business. The future depends on you, but I think it's a very bright future. Thank you very much.