May 3, 2024

The Future of Data Engineering in a Post-AI World

What does data engineering look like in a post-AI world? Will we even need data engineers? What will they do with their time if we can automate ETL? What is the role of data pipelines and data warehouses in a world of LLMs and vector databases? How can data engineering leverage this newfound intelligence to become better faster smarter?

Learn this and more in this talk by eBay Distinguished MTS Michelle Ufford. We’ll look at the evolution of data engineering, delve into major new data technologies and architectural patterns enabling AI, and evaluate ways data teams can leverage AI to grow their business impact. This talk is ideal for data engineers wanting to future-proof their skills and data leaders needing to support and leverage AI.

Topics Covered

AI & Data Science
DataOps and ELT/ETL
Performance and Cost Optimization

Sign up to watch all Subsurface 2024 sessions

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Michelle Winters:

My friends, welcome to the future. We have entered the era of AI a little over a year ago with the launch of ChatGPT and then Copilot and other similar AI tools. That was the year that AI went mainstream. And I’ve been doing technology for a minute now. I’ve never seen anything take off like I’ve seen AI take off. Already we have seen 2% of the workforce displaced and 38% of jobs disrupted, fundamentally changed by AI technologies. And this is according to recent data by SHRM. So AI is here. It’s already here. And before we can talk about the future and what’s coming next, we really should take a look at where we’ve been because this is not the first time in history that we’ve experienced this kind of disruption. 

The History of Innovation Cycles

So we have this beautiful visual from the Edison Institute that talks about recent innovation cycles in recent history, starting back in 1785 with the introduction of water power and our first factories, which gave rise to mass production and textiles. And unfortunately, we don’t have time to go through this in detail. So we’re going to have to skip forward to the 90s, which is when we saw the rise of software and our digital networks, which is paving the way for where we are today. Many of us are still grappling with digital transformation. So this has been underway for a minute. And I’m guessing if you’re listening to this talk, you are probably of an age where you remember a time before smartphones, but smartphones have only been around for 17 years, 17 years. And you think about it, think about all of the ways in which it has changed our lives, changed the ways that we communicate, that we work, that we shop, that we find information. So we are about to experience an even bigger shift in the way that we live and the way that we work, the way we interact with each other with the introduction of the sixth wave, which began again a year ago. 

So the sixth wave comes with AI and our Internet of Things, robotics and drones, and my personal favorite, clean tech and clean energy. So we’re seeing that in every single one of these waves, they have happened in less and less time, and they’ve had a greater and greater impact on society. And AI is going to be no different.

So who am I? It’s a bit of an existential question. But for the purposes of this talk, my name is Michelle Winters, and I’ve been working with data and nothing but data for the past 25 years, big data, small data and everything in between. I got my start at GoDaddy as a transactional DBA before moving over into the analytics side of the house. And I haven’t looked back since I ended up leading their data engineering, their data management and their data platform architecture teams. Really loved my time there, did some really cool things before I went on to Netflix, where I got to lead the data engineering core and then the big data tools team. Cannot say enough good things about Netflix. I loved my time there. I love the stunning colleagues I got to work with. Did some really cool things there before I decided to go leave and start my first startup, Notable. And I founded Notable because I believed that we needed to make the same sorts of big data tools available to everybody, even the companies, especially the companies that didn’t have the big tech nine-figure data budgets. And so Notable was created to make data collaboration tools based around Jupyter notebooks available for everybody. And so that company was acquired at the end of last year. And now I’m at eBay where I’m working on their end-to-end data modernization efforts. And I’m at eBay because I believe with every fiber of my being in the need, in the absolute necessity of a more sustainable future. So that’s why I’m there. And before I go any further, I have to give the disclaimer that the views expressed in this talk are mine and mine alone and do not necessarily represent the views of my past or present employers. 

A New Frontier of Analytics Maturity

So with that out of the way, let’s talk about data. We’ve been doing this for a minute, right? We’ve been working with data for a while. If we go way back to the recent past with the introduction of databases, what they allowed us to do was have all this rich business information co-located in a single spot. So we could start to create a view of the world. It was a digital representation of our physical business. And so for the first time, we had data available and we could start to ask questions of that data. So this gave rise to our ODS systems and our ability to answer the basic question of what happened, what happened in the business. And so we found this to be very powerful and we wanted to continue to invest. The next logical question we had was, why did it happen? And that gave rise to our business intelligence teams and our diagnostic analytics, just being able to answer why something is happening. And then once we got through the what and the why, the next logical question was, if we understand where we’ve been, how can we predict where we’re going? And that gave rise to our predictive analytics and our data science teams. And from there, when we understood where we’re headed, we wanted to start being able to influence where we’re headed. We wanted to be able to make the next best recommendation or the next best action. And that gave rise to machine learning and prescriptive analytics. And that brings us to where we are today with the very dawn of cognitive analytics, where we can now patch up all of these what, why, how questions into an intelligent agent that’s sitting beside us and really helping guide us in understanding the data, in understanding what’s happening, how to interpret the results, what to do next. And then my favorite, let me help you with that. 

So if you look, as we rise through these levels of the pyramid, the value to the business increases substantially, right? But you can’t skip over one of these phases to get to the next one. You can’t predict what’s going to happen if you don’t have a good basis in understanding what’s happened and why it’s happened. And you can’t really have a cognitive agent assisting you and giving you answers if you don’t have accuracy and all of the preceding levels. I mean, you can, but nobody’s going to use it. Nobody’s going to trust it. It’s not going to have the impact you want. So you can’t really skip over the phases of maturity. And as you get to cognitive analytics, that’s not even the end, right? Because we go back to the very beginning. And we want to understand for our descriptive analytics, not only the reactionary, what’s happened in the past, we want to now understand what’s happening right now in the business. And why is it happening? And if you can imagine some sort of production issue, we want to understand what’s the real time impact of this? And how do we fix it? And then again, my favorite, we help you with that. 

“New” Data Technologies Powering AI

So to make all of this possible, there are three key technologies that are enabling this and I have new in quotes, because these have been around a minute, but they’ve been very niche solutions. And now they’re really coming to the forefront in the enterprise. The first one is knowledge graphs, which allow us to represent the relationships between entities with much greater complexity and flexibility than what we can express in our relational databases. The second technology is vector databases, which allow us to take rich images or documents or even our knowledge graphs themselves and, and flatten them into an array of numeric values called vectors. And once we flatten it into these numeric arrays, we can start to apply math to them. And we can start doing similar searches between them. So we can actually say, show me the images that are most like this other image that I provided. And then the third major enabler is COPPA architecture. So we started off with batch analytics, which allowed us to do some really great things, but it was slow. So we moved into Lambda architecture, which unified our streaming and our batch processing. And this was very powerful, but it was also very complex. And so now we’re seeing a movement towards more and more COPPA architecture, which just moves straight into the streaming world. And if you need to reprocess historic data, you go back to that, that event stream, which is immutable and persistent. And so it simplifies things and allows us to get more real time. 

So how does this relate to AI? Well, if you imagine an LLM, large language model, we are leveraging the knowledge graphs to fine tune the LLM with great content. And then we are conversely using the LLM to enrich our knowledge graphs. So it’s this beautiful symbiotic relationship between the two. We’re using our vector databases for retrieval, augmentation generation, or RAG for short. So the challenge with LLMs is that they require a lot of computational processing, and they quickly get out of date, or they don’t necessarily have all of the context to answer the question in the most accurate way. So we want to ground it in the most relevant, most recent context possible to get to the highest quality answer. To do that, we take the prompt that is provided to LLM, we use an embedding model to get to our vector. And then once we have our vector array, we can actually do the similarity search using algorithms such as nearest neighbors, or KNN for short, to be able to provide the content that’s most relevant to the prompt to the LLM as prompt context, getting higher quality results as answers as a result. And then for our Kappa architecture, we’re not fine tuning the models real time, but we are wanting to have real time monitoring of the models, understand how are they performing. So if you suddenly see a whole bunch of thumb downs on a model, we want somebody to go investigate and see, is there some reason for the sudden model degradation, and because there’s real time business impact, right? So we want to, we want to have that visibility. And then, as I mentioned, with our retrieval augmentation generation, we want to ground that in the most recent information available. And so as data is coming in, as new data is coming in, we want to both update our knowledge graphs and we want to do real time embeddings into our vector databases, so that the LLM always has the most recent, most accurate body of knowledge to work from. So these are some of the key technologies that are enabling AI from a data engineering perspective. 

Relational Databases

What does this mean for relational databases? I have been hearing for the last 15 to 20 years that relational is dead. And rather than argue about it, I figured we should just look at some data. So there’s this website called dd engines, which is great, it gives you a view of the popularity of various database engines, as well as how they’re trending over time. And we can just take a snapshot of the top 20 most popular database engines in the world. And we can see that relational is still very much alive. In fact, rather than killing relational, relational has become multi model, where we’ve now brought our documents and our graph data and our vector data into the relational system. And so this is our preference. Whenever it’s available to us, whenever scale permits, we want to co-locate the data for greater efficiency, it’s easier to work from. 

And so whenever scale permits, we will co-locate the data. Scale does not always provide for that. So sometimes we have multiple engines. And that’s okay. But the trend is that relational is not going anywhere. It’s still our transactional systems, it’s still our point of sale, still very much has a place in the enterprise. 

Something else I want to call out here is that there are 418 database engines today. And this number is only growing. And that’s because there’s a lot of niche uses that we have and different use cases that we want to support. So throughout this talk, and in general, you’ll never hear me talking about one technology to rule them all. Which is the best technology because I truly believe the answer is always it depends. It depends on the industry you work in and the use cases and the constraints you have and the data access patterns and the scalability that you’re working with. All of these things matter and help us get to the right technology for your use cases. 

The Modern Data Stack

Okay, so let’s look at the modern data stack where we are today. The first thing is that we continue to see the need to support a broader and broader set of users with our data stack. We’ve now pushed more heavily into the rest of the business operations teams, and our product engineering teams have a need to work with analytics and we want the same sort of experience to support them. That also supports our AI team. 

We have been moving to the cloud for a minute, we’re actually starting to see some people come back on prem because we found out the clouds pretty expensive. And so you’re still seeing hybrid exist. And we’re also seeing that we’re multi cloud, right, we don’t want to be vendor locked. But the big movement here is really around edge computing. And this is where you’re seeing your IoT, this is where you’re seeing your CDN. If you’ve got autonomous vehicles, if you’re working on smart homes, you’ve got a lot of edge computing there. 

Compute and storage continue to be the biggest constraints we face. And so to help with that, we are seeing a strong movement towards decoupling of these two. The data volumes are not going to slow down anytime soon. So we’re actually seeing even vendors have to decouple their storage and compute because we want to store it once and we want to serve it through a variety of different computational engines depending on the use cases that we have. 

Data lakes, data marts, lake houses, active houses, you know, one of the questions I get a lot is are data marts going away? The answer is no, especially as we move to concepts like data mesh, which is a organizational decentralization approach, I think that the data marts are actually increasing. And then our data warehouses and our lake houses are there to unify those views across all of those different data marts. So all of this continues to be very relevant. I already touched on graphs, but one thing I want to touch on is micro databases. I just heard about this recently, and I love the concept. It’s essentially a data lake of one, which allows you to have a single data lake for each customer or for each product. So you can imagine if you have a smart home, you need someplace to store all of your account information, all of your application configurations, and all of your sensor data, right? So you want to have some place to be able to serve analytics to the customer. And so this is a beautiful way of making sure that they have high performance while also preserving privacy and governance. So should they drop, cancel, discontinue the account, you could either drop this or you can decouple the account information from the rest of the data. So it makes it easy to manage. 

Data products. We are thinking more and more in terms of data products. We used to have dashboards and reports, and now we’re seeing our data products can actually be the data itself, right? And we’re also seeing that models are data products, right? So all of these are really encapsulated under this idea of data products, and the data products are really the way that all of these teams to the left interface with the rest of the business. So data fabric, hopefully you have something like this in your enterprise today. 

And so data fabric is really, I think, the technological component that makes something like data mesh actually doable, right? Decentralization doable, and you need to have some cohesion between all of the different components in your data stack. So I’m assuming that you already have this, if not, the first step is to get to a data fabric. If you don’t have it, the next phase, or if you do have it, then the next phase is to bring intelligence to it. So this starts with data security. Data security has always been important and will continue to be even more important, especially as our bad actors get even more intelligent with the assistance of AI. So we need to get smarter too, and we need to really think in terms of integrated security at every layer. And we need to think about how can we build our own anomaly detection models? How can we look for those bad actors and prevent them from actually gaining access? 

Data governance. We really need to continue to think about governance as a first class citizen and bake it into every phase of this entire system, because you shouldn’t have a whoopsie, secure data got out. You should really have privacy built in as part of the process. So whether it’s tokenization or whether it’s asset classification, we really want to make that intelligent and make that easy to do. Metadata management. This is incredibly important. We want to have, if not co-located, at least a federated view of all of the metadata in the company related to all of our data assets. Everything from what assets exist and what classification is and what the meaning or annotations are on it, all the way to what is the actual utilization, who’s using it, what’s the costing of this, both in terms of kilowatt hours and CPO as well as dollar cost. And so we want to have this all readily available because this is going to inform the semantic layer that this AI is really built upon. 

Query experience. So we want to have data readily available through our data products, but that’s not always feasible. And sometimes we have to fall back to queries. When that’s happening, we want our data query experience to be very intelligent, very uniform. We want to have a single place to go to serve queries across a variety of different database engines. We want these queries to be intelligent. We want you to be able to optimize them for me and better yet, just write the query for me. Right. So that is where we would like to be. We also want to see query federation or virtualization. So you don’t have to move data and have a lot of wasted movement there whenever skill permits. APIs and SDKs are the way that our engineers interface with the systems, but it’s also the way our systems interface with the systems. So we want to see increasing uniformity around these, making it easier and easier for the systems to be able to integrate with our systems, our data fabric. 

And then caching, as I mentioned, we’re very resource constrained. So we want to be very computationally efficient and we want to think about caching. And this is an area that I’m especially excited to see what AI can do to help perhaps with caching validation. 

All right, pipelines. This is an area that I think is ripe for automation with AI. I think the data transport, we shouldn’t be spending our time moving data from point A to point B. Data ingestion should largely be taken care of for us. But we do need to have really rich semantics around this so we can understand it. We need to establish data contracts for this to be able to happen. When we do write code, we want to think in terms of reusable modules and functions and put those in shareable libraries. We shouldn’t spend a lot of time having to write code. We want to move more even towards a no code like solution or better yet, like natural language type solution. 

Lineage. So we want to think comprehensively in terms of lineage. This overlaps with metadata management because we want to have that uniform view, a unified view of the entire data world. And so this starts with our applications and goes into our source databases and goes into our data transformations, both the physical ones and any sort of logical abstractions. And then we also want to have all of our consuming systems, all of our reports and our models and all of the applications that consume those models. We want to see this end to end. We don’t want to have a black box because it’s incredibly important to understand where did this data come from and how has it evolved before it got to where it’s being served. Quality has been an ongoing challenge. And I think this is an area that I’m also really excited to see what AI can help with because it’s funny how many people I talk with and I say, what level of your data would you consider high quality? And they’re like, ah, maybe 10%. And I’m like, 10% high quality trustworthy data in a company is a problem, right? We really should have 100% high quality, well understood data or at least much higher than 10%. And the problem is that we either have not enough quality checks or we have too many and we’re getting lots of false alarms and people are ignoring it and neither one’s a good solution. So we really want to see how can we right size quality relative to its impact and improve the accuracy of our detection so that we’re accounting for things like seasonality and you know, unexpected behaviors. 

Data ops, we want to see how can we take the semantic model and all of this information that we have now and really automate our systems, right? So how can we simplify the maintenance and management of this entire process? Data orchestration to tie all of these things together and then asset catalog for a rich discovery experience on top. 

Future of AI in the Workplace

So I’m just scratching the surface on this because there’s more that I want to get to. I really want to talk about where we’re headed, right? So this is on the near future, the type of intelligence that we’re working on. And the question that people have is once we’re done, what about our jobs? Is AI going to replace us? This is the question I get more than any other ones. Is there a role for data engineers in the future? And I think the answer is a bit uncomfortable because the answer is really, it depends. It depends on where you work. It depends on if you’re at a company that focuses on profitability over quality of experience. If you do work at one of those companies, well, they’re going to focus on increasing profit every way they can. Do you work at a company that sees data as a cost center? Well, they’re going to want to optimize their cost, reduce it every way they can. Are you primarily in a role that is data movement from point A to point B with no real ownership and no real ability to influence anything? If so, your job might be at risk. 

But the reality here is that the best way we’ve seen through the data, we’ve seen this through other companies’ performance, the best way to get to long-term profitability is sustainable long-term growth. Best way to get there is through a high customer satisfaction level. Best way to get there is to understand your customers with data and to prove their experiences with data. For every company that sees data as a cost center, there are so many more that see data as a profit center. Netflix is one of those companies. I cannot say enough good things about Netflix and the data-driven culture that they’ve built. They’ve invested heavily in data. They do an incredible job. When I think about why do I not think AI is going to replace our jobs, I actually have a little story from Netflix about Stranger Things to illustrate this. 

When they were looking at content, Netflix’s North Star is really about customer joy. They want to understand how can we provide high-quality content to serve all of our subscribers. One of the ways they do this is they have various models that look to see do we have enough content in a certain set of genres. When you looked at Stranger Things and you looked at the data, the data was not clear. It was a bit inclusive because we had a lot of mystery. We had a lot of sci-fi. We had enough nostalgia. If you were just relying on the data or the model to make a decision, it would have probably not decided to go forward with Stranger Things, but it wasn’t a model that was making the decision. It was a person, and that person knew something that the data did not. He knew how he felt when he read the script. He knew that even though it hit on all these genres, it was really a genre-bending series that had never really been seen before. He made the decision to green light this. What a great decision it was because it went on to be one of their biggest hits. It won tons of awards. It brought in lots of new subscribers. If we had just been looking at the data, we might not have made that decision. 

At the end of the day, if you think about what we’re really doing, we are trying to build better products and better experiences for people. I think you will always get better results taking knowledgeable people who are passionate about what they do and having them operate the models with the additional context that only they can provide instead of having the models try to replace those people. I do think that AI is going to change our roles quite a bit. I really do. 

Rise of the Modern Data Team… Powered by AI

One thing is I think you’re going to see that there’s a model for that. All of the things that we’re currently spending our time tasking on, there’s going to be a model that largely operates it. We want people to get out of the tasking phase. We want them to get into more of an ownership model where they really are going after objectives. I see there being three key roles when I look at data. I see the first one being the data product owner who is responsible for data as an asset. They are working with the source teams and the consuming teams in the business. They are ensuring that we understand the value of the data, we’re crafting the data and making sure that we have it to achieve the goals that we have. The second one is our domain aligned data expert. I see this being our data engineers today bifurcating into one of two roles and this being one of them. If you primarily work with C today, this might be your role. This person is somebody who’s responsible for going end to end with the data and overseeing the overall systems. They’re overseeing the pipelines. They’re overseeing the quality. They are ensuring that everything’s working the way it should. Google has this idea of a single responsible person. You have to have a single responsible person for anything you care about or things might not go well. And we really, really, really care about the data because it’s the data that’s feeding the AI systems. So I see us having these domain data experts to ensure the quality of what we’re providing. 

And then the data artisan. When you talk to the best software engineers, they will tell you that software engineering is part art, part science. And I think the same is true with data engineers. And so if we have the models that are helping us with the science part of things, you’re going to need people who are helping with the art. And if you look at all of these different models, they’re not just going to automagically happen. You’re going to need somebody who understands what’s happening underneath the covers who are actually making it happen. So these are going to be people who are improving the models, maybe creating the models themselves, maybe testing out the algorithms that are then leveraged by the models. These would be the people who are primarily today writing Python and Java and Scala. And you might need less of these people relative to the other domain data experts, but they’re going to be very high impact the ones you do have. 

6th Industrial Revolution

OK, so I can see I’m running out of time. Real quick, I want to talk about the sixth industrial revolution going back to where we begin. So I am super excited for where we are headed. There are a few key trends that are enabling us. The first one is quantum computing. Quantum computing today is one hundred and fifty eight million times faster than today’s supercomputers. A hundred and fifty eight million times faster than today’s supercomputers. I struggle to comprehend that I struggle to really understand what’s going to be possible in quantum computing becomes mainstream. I’m excited about this. The second one is graphene. My kids are so tired of hearing me talk about graphene because I am so excited. I think of this. It’s the material of the future. It is made out of carbon, the most abundant material on the planet. So we don’t have to go into really unsustainable practices for mining in the way that we currently do. It is at one atom thick, a hundred times stronger than steel. It’s practically invisible. It’s flexible. It’s light. It is the most energy efficient material that we have ever discovered. So this is going to be the material that is powering our data centers, our spacecraft, our robotics. It’s even going to be in our clothes. And we’re going to be able to sit there and absorb the energy coming off of our bodies and feed that back into our cell phones. Right? So this is incredible. 

And then we need intelligence tying all of this together, which is where AI comes into play. So I see all of these things coming together. And when I look at our jobs, I do not think our jobs are going anywhere. We’re going to have tons of sensor data. We’re going to have even more and more need for data. I do think though that our jobs are going to be evolving and we need to be ready to evolve with them. 

I’m going to leave you on this last thought. Very notable, I took off two years to really do some soul searching and reflection to figure out what comes next. And I’ll be honest, I started in kind of a doom and gloom headspace. I wasn’t really sure where things were headed and I was really worried about the future. And so I educated myself and I wish I could share with you everything I learned, but I’ll summarize it by just telling you, it doesn’t matter if you’re a billionaire or if you’re homeless. We are all essentially saying the same things today, which is that we are not okay. And that what we’re doing is not working. But when I look at every single major issue facing society, I think we have the foundation. I think AI is part of the solution in making those problems better, right? And I don’t think that we just get there automatically. I think it’s everybody listening to this talk, all of the data people who are making better decisions about where they spend their time, about where they put their efforts. Because I’m not just saying this, I did this myself. When I decided to take my next role, I didn’t take the sexiest title, the coolest company or the highest paying role. I went to the place where I thought I could do the most good, where I was leaning into sustainability, right? And so I think if we all do that, especially if you’re worried about where we’re headed, take a good hard look at where you’re working and ask yourself, do they have good employee practices? Do they have good sustainability practices? Are they part of the problem or part of the solution? If they’re not part of the solution, maybe look for someplace that is. If we all do that, I think we get to a much better tomorrow. 

I’ll leave you on this last thought. Imagine a world where we have a little intelligent agent sitting or a little intelligent support system sitting in our ear. So as you’re having conversations with your children, with your partner, with your colleagues, it’s actually informing your conversation. And it’s also informing you, it’s saying, hey, maybe you should check your tone there. Hey, maybe we should go revisit that comment you just made later tonight and explore some childhood traumas that might be a bit unresolved, right? Think about how we can all be better as a result, how we can be better people, better partners, how we can do better for the planet. So that’s what is exciting me. I really think that we have a better future ahead of us. Thank you. That’s my talk. Thank you for listening. you

header-bg