March 1, 2023

12:20 pm - 12:50 pm PST

Block-by-Block: Making web3 Data Accessible Using an Open Lakehouse Architecture

Lessons learned from a year of building – a data and AI platform for web and blockchain data leveraging open lakehouse technologies, including Apache Arrow and Iceberg.

Topics Covered

Customer Use Cases
Open Source
Real-world implementation

Sign up to watch all Subsurface 2023 sessions


Note: This transcript was created using speech recognition software. It may contain errors.

Luke Kim:

So hi everyone. My name’s Luke and I’m excited to share with you today our story of growing from zero to one, building a data platform for Web three and blockchain data. So this story is really just the first chapter. We’re a fairly early stage startup. We’ve been building this for about a year. so just like all stories, we’ll have a beginning, a middle, and an end. And so we’ll go through what problem we’re solving here how we did it with Open Lakehouse architecture and some of the learnings that we got from that. So, the beginning, who am I and what, what are we doing? So, Luke Kim founder and CEO of Spice ai. We are a VC back startup here in Seattle and focused on building a platform to help developers build AI driven applications. Before that, I was at Microsoft, co-founder of Azure Incubations, the office cto, worked on a bunch of developer services, GitHub infrastructure all of our major data platforms and Visual Studio as well.

Some of the projects that came out of my teams were dapper, which is quite big over 20,000 stars on GitHub. has open source has startups behind that now as well. and then we also have our own open source projects by AI oss that we are developing as well. So there’s been a per proliferation of AI products out there right now with chat G P T and so forth, and we worked on GitHub. and of course, GitHub has its own GitHub co-pilot, which helps you write code. And so you’ll see things, I think in every aspect, every industry where you have a co-pilot that helps you basically be better. And we think the same concept is going to apply to applications. And so along within your application process, you’ll have a copilot runtime process, and that will take contextual information from your application, application data, environmental data and so forth and utilize AI to provide that application with decision, decision recommendations and experience to help that application make high performance autonomous decisions. And really where we’re going here is we wanna go from just say, recommendations in applications to really autonomous software, right? You can think of an great analogies, self-driving vehicles have autonomous vehicles, and here we’ll have virtual autonomous applications that run throughout cyberspace.

So what really drives AI progress, if you look back in history, it’s always been massive open dataset. So start with meteorological data. It’s really drove a, a bunch of machine learning in where the forecasting and so forth. And then of course, early days of the web really grew this another step change in machine learning for great text data sets across all of the web. And of course, Google is still built amazing business out of that even up until today. Then of course, we had access to maps and geographic data, and that produced, you know, your Ubers and your Lyfts and all these self-driving star cut ups and Tesla and so forth. And again, another step change in AI progress throughout the ages. And of course, here we are today where we have these amazing things like chat, pt, mid journey, stable diffusion, and all of these, again, really that were driven from access to a big massive open dataset, which is rich media, social media bunch of Wikipedia stuff bunch of code from GitHub and so forth, which really wasn’t available previously.

And at Spice, we believe that there’s gonna be another step change, and that is blockchains. Blockchains are essentially the world’s largest distributed open set of compute execution data. So not just in terms of transactions, but when you have blockchains like Ethereum, which have a virtual machine that come along with them, essentially masses, amounts of compute that’s being done in the open in a distributed manner that we can actually access as a training data set. And we believe that this is going to lead to another step change in the ability to create decision making, AI and autonomous applications. So given that thesis, we thought, okay, we’ll go out and get access to all the blockchain data out there in the world and use that along with the other data certs to feed into our training for decision making ai.

And so we come to the middle, and the middle is we build this big data in AI infrastructure for Web three data. But it turns out that ironically, even though blockchain data is open, it’s actually extremely painful to go and use. it’s incredibly slow. I, you know, essentially blockchains are amazing, but they’re also the world’s slowest databases. and so you have to go build and operate these massive blockchain nodes. Ethereum’s about 13 terabytes raw format, solana’s about 150 terabytes in raw format. That’s one node. You have to build and operate all the data infrastructure around those nodes, and you have to un understand really quiet, detailed technical details about smart contracts, execution logs, virtual machines, and so forth. So here is a little bit of a primer on blockchain data. blockchains essentially arranged in these chains and they are ledgers of data.

And so you have essentially this raw state. Now, if we want to go and take that state to train off, we essentially do what a lot of traditional systems will do today. We ingest that data, we etl it into our, into our system, but blockchains also have an additional property, which is non-traditional, and they have these virtual machines. And virtual machines also work off this state. And so as part of our process, we not only have to look at the state, but we have to do function calls against that virtual machine and run actual code in our e TL pipeline and combine that with the state to actually produce our corpus of data. and so that’s actually a quite a significant undertaking. Nothing out there is actually designed natively to do this. and so we knew that we were gonna have to build our own architecture.

And so being an early stage startup, we wanted to utilize as much stuff out there as possible, not rebuild the world. And so we really wanted to take bets on industry adopted big open source projects like Apache, arrow, parquet and so forth. we also wanted to be very high performance. So we’re very interested in building a lot of things in Go. And we looked for go support in each of these projects as well because this is all open a massive, massive dataset across all of these different change. and we knew that we wanted to do both query machine learning and a whole bunch of different type of compute workloads on the same set of data. Open Lakehouse architecture was really important for us. and then also a mesh architecture was important for us because it turns out that, say, Bitcoin and Ethereum and Polygon are completely different and so they have different, you know, maybe different stores that we wanna put them in. and in addition, we wanted to combine our on chain data with off chain data. So things like social media, Twitter, and so forth. And so the ability to take all of these data sets, combine them into one view for both query and training and inferencing was super important to us.

And so here’s what we essentially do today. We take a bunch of blockchain nodes, we index and enrich that data. we make it available in real time and for complete historical query all the way back blockchains are transitive. So essentially if you say, want to get the balance of it, so of a chain, cuz it’s a ledger, you essentially have to go all the way back to the first block, to the current block to actually find what their balance is. So we need to be able to access in real time and full history. And we also knew that we wanted to do full hosting AI ML on this data. we enrich it into things like NFTs and defi and so forth, and make it available over high performance Apache Arrow flight APIs to our customers and, and the ability for them to use the current ecosystem of data science libraries today.

And so again, just to reiterate, the two things that we really cared about here is the ability to have this data mesh across a monitor of different sources. And then in addition, we need us to be super fast because we wanted to do realtime inferencing, realtime query. And so that’s where a tiered cashing model really made sense to us. And that basically made Romeo the natural choice because it’s really aligned with these open technologies like Iceberg and Parquet and Arrow. and then it had both this meshing ability and the ability to do tiered caching. So here is a architectural diagram of what we have today. It’s quite simplified just pulling out the main pieces, but you can see down the bottom here, we access a bunch of blockchain data and we have these little managers for them.

And all of that gets indexed through our own systems. And we drop that in data lakes and data warehouses and post and elastic search and these types of things. and then we use Dremio to really produce a view over the top of them that’s unified and highly, highly performant. and then because we’re an an actual productive within external customers, it’s a little bit different to, you know, many of the use cases here, which are internal systems. it’s really important that we have very high performance and very high stability because other customers of ours are building applications that rely on this. And so if a query goes bad or, or call goes wrong, their application breaks is not just like a dashboard breaks and you have to refresh it. now we also have this dremeo manager here, which I’ll talk about in a little bit. but the point to take away from this is we essentially have built a product, an entire platform around using dremeo as a platform.

So to reiterate this fact we’ll Dr. Dive in a little bit to our API service. And here you can see we have a user. and that could be a dashboard, it could be an application, it could be a notebook. And we have our own flight server and it connects over Apache flight. and we also make a HTP available as well. we have an compo component to do au authentication and rate limiting and billing. And then we passed that through to Dremio on the backend over over flight again. So everything is as flat as much as we can. this really iterates, I think the, the ability that we’re, we are produc, we’re providing all this as a platform to end user customers external to us. and so it’s really, really important that it is highly performing. So because of that, we had some challenges and that, you know, we heard some here that, okay, we’re gonna go build reflections daily or maybe hourly.

We do that, we build reflections all the time in real time. and we have data coming on these blockchains even up to every 150 milliseconds. So we had a bunch of challenges to solve around realtime time series data. It was a massive joins. So these blockchains, they can have billions of logs cuz you’re doing every execution of the trace logs in each one with to, to join those with transactions. So we have to come up with some creative ways to do partitioning and scoping of our queries to be able to make that possible. And then there’s really nothing out there that’s designed for this type of data in this way. And so we had to do a whole bunch of tuning for stability similar to what was said here before in terms of things like dedicated engines and partitioning and, and so forth to, to make it highly.

So that led us to create Dremeo Manager. So Dremeo Manager, we call, we also call it the medic internally. but it is, it allows us to really scale both Dremeo and ourselves cuz we are, again, an early stage startup, fairly small team. We, we now have the ability to do everything in Drio from source code. So we have all of our reflections, all of our data sets the, the scoping for those, the each field that goes into a reflections all defined in yammel in source code, so in source code and through a, a continuous C I C D pipeline. And so that really gives us the ability to have full raise rebuild capabilities. and then we also have 100% automation. In fact, I do not let our enj engineers access Drio front end at all. Everything that we do is completely over APIs.

And so that allows us to have this massive automation and, and really scale the use of drio to a huge number of data sets, huge number of reflections. We also have continuous monitoring and integrity of the, the, the integrity and health of drio the machines, but all the way up into dataset and reflections, and then even the data integrity on those. So if a reflection’s still our system will continually monitor, tell us, and then say, Hey, what’s wrong with that reflection? And automatically restart and rebuild it. redefine it if it, if it has to. it also does a whole bunch of queue management. and again, like everything is defined in source control, including our iceberg skimmers, for example. It’s all deployed out through C I C D. And this has really given us the ability to, to scale our use of drio.

Here’s a little bit of our monitoring. We monitor every little thing that happens in the system from every failure to every reflection update. here on the right we have one of our deployments, we have 143 reflections. And you can see each one of those spikes down is where reflections broke essentially. And our system immediately identified that and rebuilt them or, or, or fixed them as we were required. we also monitor everything from heat usage to job waitings and queues and we’ll will, will take automated action where required. So as I mentioned, different blockchains have different use cases and, and different characteristics. So Bitcoin, you get new data, every 10 minutes slide, you get new data every 150 milliseconds, and they’re actually making that faster. our goal as a platform was to make data available for query within second, ideally subsecond of that data available.

So that means a new block gets minted, all of that data goes through our entire retail pipeline, our processing, everything in drio available for query in a second. and as far as we know, when we talk to some of the people at drio, we’re one of the only people that are doing this amount of data this faster this rate. and so we have to make a high, high, high usage of things like incremental reflections, dedicated engines high level ing and so forth. We currently have 120 terabytes of data. We’re about to add Solana, which is another 150 in raw thermal without even reflections. And so we’ll probably double our usage every six months, three six months. and so things like iceberg is just critical for us because to both offer the performance that our customers expect and the ability to offer at a cost efficiency, we have to be able to read stuff read the least amount of data off as possible in real time.

So that was a quick run through of, of what, where we got to, here are some learnings that we got from it. for drio, the bigger the cash, the better. a lot of problems can just be solved for bigger, bigger machines. We have two terabyte rare machines with 35 terabyte C3 cashes right now. 128 cores. makes everything better faster, it’s great. also we can make use of tiered caching. So we will have reflections next to the machines. and then we actually make use of a shared data lake from a, a bunch of different deployments, a hundred percent automation to, to run at scale. I’ve already mentioned that as as helped us massively. We, we basically have like, we don’t even have like an on-call, like it’s just all automated. safe deployment practices. We, we did a lot of this at Azure when I was in Azure.

You want to deploy out to a deployment, you wanna really monitor it, test it, ensure that it works, and then deploy your next one. and so we now have a complete automated deployment checklist where the CICD system will deploy out a new version to do all the testing. It’ll build reflections, it’ll test reflections, all this type of stuff, and then move on to the next one. we’ve found out that most streamer upgrades are one way. And so it’s really hard to roll back once you do it. And so it’s really important for us to ensure that we use safe deployment practices. performance is very heavily reliant on the job schedule and query planner. So things like number of CPOs really matter. High, high number of burstable. Cause even things like the order of union alls mattered in terms of how it could, how many calls it needed for execution. And again, we mentioned these large joint patterns. so things like range restrictions, scoping really mattered. petitioning really matters here. then also check the results. So our automation has kind of mentioned it would like create a, a reflection, but then it goes, checks it, make sure that reflection actually had the data that we expect and so forth all in real time.

So a bit of the learnings we have, where, where are we going from here? As I kind of mentioned, we both have hybrid on, on-prem and cloud. so we make use of elastic engines at cloud, but these big high performance machines on prem and then have shared data lakes where each one of those deployments can point to them that way. You kind of have this model of tiered data all the way back up to, to the different deployments. we’re deploying a bunch of machine learning infrastructure right now, and so we’re also pretty excited about the potential to use some of that infrastructure for hardware acceleration on queries within drio. and you can see this thing on the right here. We’re already making use of a lot of this data to do time series prediction and forecasting, anomaly detection, this type of things which will help us build applications going forward.

Things we would like to see. As I mentioned, we use during a, as a platform or everything through api. So, and we try to do everything over flight. So even more metadata over flight would be awesome. Richer job APIs and, and and APIs in general. the two things that I mentioned that we really get from varying, one is a mesh. And so really deep integration with sources helps is, is more helpful for us. So for example, Postgre has a raise in it. those arrays we have to translate to text lists in Drio. So if we’re gonna use something as a data mesh, it’s really only as good as all the things it can mesh with, right? And so improvements there would be awesome. All of our dairy is time series, so blockchains are time series. So even better time stories, support would be awesome.

And then in terms of iceberg, we think clustering would really help. So block numbers and timestamps are both unique and incremental and the same for every block. So every block has a block number and a timestamp, and the next block has a block number and timestamp. But today we have to build two reflections, one for block numbers, one for timestamps. Ideally we could build one reflection and the reflection would know, okay, both of these things I can index into and they’re the same. and then web three data use a lot of 2 56 big numbers. So we would love improvements to Jeremy on Sports 1 28 today. just little things like this with we how to work around it would be awesome. So that was our journey from basically Zoom zero. We started building this in January last year, literally first line code built on in complete data platform for all of this with about three people in the first half of the year. and we did a lot of that because we were able to automate a lot of that. and yeah, we’re pretty excited about we can go f forward from here with Drio and this, this open Lakehouse architecture and the ability to apply large scale machine learning to it. So with that, thank you. And we have time for any questions. Happy to answer.


Awesome. Thanks Luke. Yeah. so we can start, see if there’s any questions in the audience. Did you have one? I ? Yeah they can go first. You’re allowed a couple , there are questions online as well. okay. Here I can just give you,

Audience Member 1:

So the first question is about getting the data off the chain. and obviously that can be expensive and time consuming. So how have you guys managed to get the data? And then are you guys building machine learning for each chain individually or are you actually putting that into one overall pool that you’re putting? You know, I don’t know if you’re a, I don’t understand the businesses like selling the algorithms you’re creating or how that works to your customer yet, but are you basically combining the different, like Solana and Ethereum would be a, a good thing? Are you combining those data sets to build models on top of, or you again, just building ’em by chain?

Luke Kim:

Yeah, great question. So think the question was how do we get the data out of blockchain nodes and then are we combining all the data? So the first question is, we run our own nodes. The only way that you can do that to do the the scale is to run your own nodes. We initially started using a hosted provider. We used, yeah, we used, I think all of our paid usage in the first five minutes in our development stamp. so we do millions and millions, if not billions of calls against these nodes. and the model that we’ve used is u if we went back to that architecture slide, we have this little manager. We do everything on the same box as the node, and it’s all Apache Arrow internally, and it’s all as much zero copy.

So we basically get as much stuff into our little process as possible. It’s arrow from there on. we do all our processing so just a lot of hits against the node. and we do that in real time. It’s not so bad, but if we have to have to backfill data, then yeah, we, we can go all the way back to the beginning of the chain. It can take weeks to, to get all that data. in terms of the meshing of the data, we, we have it all available. This was one of the things Drio gives us is all available in one pool. So you can do cross joins against Solana across Bitcoin. You can do, we have people doing NFT analysis or security analysis across say, NFTs across Solana and Polygon and Ethereum. or Unis Swap is a defi protocol that’s on several chains.

And so you might wanna do a query across all of those. right now pretty much any use case around security is really important. securing, securing the system. So monitoring things like liquidity pools in defi or lending pools to ensure that you don’t get, kind of get some of these events that we’ve had last year in the system. Things like FTX and lunar and these types of things. Can we monitor all that data, provide observability and take mitigating action before these things actually get bad? yeah, so I think overall, the, the whole, the whole industry is gonna get a lot safer when we have a lot more of these technologies as and be able to monitor, have this observability around security. Yeah.


Great. Thanks. we had a question here online that basically if I can paraphrase, it’s, it’s kind of, you know, you guys have not just, you’ve built this, all this framework, right? All this automation and not just like in theory, you’ve actually in practice, you have all the crafts charts, shows the ops, right? You’ve actually seen how it works. How did you go about and kind of developing that kind of comprehensive chest framework to know when something does go wrong that you, that you notice it?

Luke Kim:

Yeah. great question. So I was 13 years at Microsoft. The last two years I was in the CTO’s office in Azure and I’ve been building large scale distributed systems for most of my career. And so it took a lot of the lessons from that basically automate everything deploy as fast as possible so every deployment goes through as fast as possible and rely on real time in production monitoring. we would rather have something break in production and fix it really fast than trying to make everything work perfectly beforehand. Because if you get into that model and it is basically a model of resilience, then you you can, cuz something’s always gone wrong, go wrong, you’re always gonna miss something, then you can essentially recover as fast as possible. And so that was the principle really first principles approach to, okay, how are we going to scale and run this at at scale. basically it it mantra it as close as possible and build as much automation in the fixing as much as possible and then be able to deploy it out as fast as possible. If you have those three things then you can make a pretty scalable system.


Yeah. Sounds auto experience. And that could be a whole talk on its own, huh? . any other questions? from the room? Yeah, James?

Audience Member 2:

Yeah. you mentioned the use of heavy use of reflections. I mean, how, how, how are you keeping these reflections fresh? At what inter what intervals in order to, you know, do the reporting that you guys need on the platform?

Luke Kim:

Yeah, so to do the really high performance stuff, we essentially make use of Dremio data meshing capabilities. So we don’t put data immediately in data lakes. We put it in things like postgre and other realtime stores and we join that data together and then we will another property of blockchains is they have these things called reorgs where you can essentially have a data the longest chain wins. So you’ll have a set of data that data could end up being thrown away and it will go completely to another set of data. So we essentially have to have this buffer where we are monitoring the data. If a reorg happens, then we swap over to a new set of data. and then when we are confident that that data is going to be permanently written into the chain, we then take all of that, put it into a data lake, and then run the reflection on that. So we basically have the reflections only on a data lake not on the postgre data, but we join the two using the data mesh. In terms of the, the interval, that can be depending on the chain, it can be anywhere from a minute, less than a minute, all the way up to 30 minutes. An hour. Yeah.


Cool. Thanks. Any other questions here in the room? Cool. All right, well thanks Luke. You’ve definitely got some good feedback online. Great session. This is fantastic. Thanks for the amazing session. So thank you very much. you’re round applause for Luke.

Luke Kim:

Thank you.