Subsurface LIVE Winter 2021
GOing Native with Arrow Flight and Dremio
While some cloud data lake technologies have great integrations in multiple languages, others are actually a bit difficult to use with Golang as your primary language. This session describes how and why FactSet pursued native Golang connectors to Dremio, as well as a native Golang Apache Arrow Flight server and client implementation.
Matt Topol, Principal Software Engineer, Fact Set
Matt Topol is a Principal Software Engineer at FactSet, with a large passion for distributed data systems and contributing to the open source community. Since joining FactSet in 2009, Matt has worked on both infrastructure and application development including leading teams and architecting large-scale distributed financial data systems. As part of building data-driven computational systems, Matt has contributed to several open source projects including a popular Golang HDFS library, and Apache Arrow's C++ and Golang libraries. He has also created open source projects like Quart-OpenAPI for python, and a driver for Golang's database library called go-drill.
[00:03:30] Welcome [00:04:00] everyone. Thank you for joining our session this afternoon. My name is Emily, and I’m from Dremio. I’ll be moderating this presentation today, and I’m glad to welcome Matt, who will be presenting “Going native with Arrow Flight and Dremio”. If you have any questions during the presentation, please type them into the chat at any time. We will address these questions live during the Q and A session at the end of today’s presentation. And now I’ll turn it over to our speaker, Matt.
Hey. [00:04:30] There’s my slides, cool. Hi everybody, I am Matt Topple. I work at FactSet Research. If not familiar with FactSet, we provide financial data and utilities for analyzing the data to lots of places and for hedge funds and banks. I’m actually going to use some of that as context for the presentation. [00:05:00] One of the recent projects I was working on, involves financial transaction deal information. So you have the relationships that you’re dealing with between deals and you have an issue or every deal can have one or more buyers, issuers, and buyers or companies, and you have all the financial information related to the companies. And then you have even more relationships between the companies and their investments. [00:05:30] Investors are a part of those investments, investors in the companies and so on, so forth. So it’s very complex relationship wise.
You’re dealing with many to many, to many, to many relationships. And all of this data is stored in different locations, different physical places, different databases. So to address this, we have all these data in different sources, and we need to get it to the users in a way that they can utilize. So when you describe the data in that format and [00:06:00] you have all those relationships in those ways, one of the first things that comes to mind is using a graph database. And so of course, we tried that. We tried Neptune, we tried DEGraph and we marked a couple of different use cases against our current implementation and against utilizing Dremio. And as you can see from the data here, Dremio was far and away the significant [00:06:30] winner in almost all those use cases.
Actually pretty much all those unique use cases we get all along. So why do we think of Dremomys? Well, it came up because the data currently lives in SQL server. And even if we’re evaluating that data quickly, the ODBC becomes our bottleneck. And that’s where the bulk of this presentation is going to get into, is that data transfer piece. [00:07:00] We’re also in the process of moving away from SQL server. We’re trying to onboard different content with different formats, different locations, and we have those different plugins and communications and deployments. If we’re going to use Dremomys as that back source to collect all the data together and we’d get all that data to the users, we need to build some kind of service to interact. And that’s where Go comes in. [00:07:30] Before I get into the specifics and aside from personal references and whatnot, one of the big benefits we got from Go is, dealing with a statically compiled situation over dynamic linking.
Now I’m not going to get into the big, full discussion over static linking versus dynamic linking. Suffice to say for us, when you’re dealing with a [00:08:00] Heroku like platform, that you’re deploying services via containers, if you have a bunch of binary dependencies and libraries that you’re depending on, deployment on those containers can become very difficult. When you’re associating those containers and you have binary dependencies and libraries, you need to ensure you have all of the other dynamic libraries that those dependencies need. And then if your containers aren’t all the same operating systems [00:08:30] and platform, maybe some of them are Red Hat, some of them are Ubuntu, you need to make sure that you are compiled and pre-build for all the different platforms. So that becomes part of the issue there. Whereas by utilizing Go, the fact that everything is static compiled, and we get a single binary that we can just deploy, makes life significantly easier and makes deployments significantly easier when we’re dealing with it. Not only that, but I rather like their Gopher mascot, that’s my favorite mascot when it comes [00:09:00] to programming languages currently.
We have our data sources and we have Dremomys as our interface there, we have a Golang service, so we need to communicate between Dremomys and Go. Now, the two obvious answers are either using the rest API, which Dremomys provides or using ODBC. Now, there’s pros and cons to both. Rest is fairly simple utilize, [00:09:30] but you’re dealing with very large JSON blobs, which means that back and forth across the network can end up being slow, especially if there’s a lot of latency with where you’re looking from. On the other side, ODBC can be very fast because the binary protocol, but as is mentioned in the keynote today, ODBC was originally designed during a time when network latencies were a lot slower and where data was not as large as it is now. So in the case of [00:10:00] when we’re dealing with going from Dremio or even going from SQL server, ODBC itself has become a bottleneck for that data transfer.
If we want to hit the performance goals, we need to hit. We needed something better than either Rest or ODBC. So the first thing we tried, is the fact that Dremomys has a RPC protocol that it provides. In fact, since Dremomys is built on top of Apache Drill, they [00:10:30] both implement the same protobuf based RPC interface, which means that I can communicate with Dremio in a binary protocol, direct messaging, and significantly faster than ODBC. In fact, this is actually how part of their ODBC driver is actually implemented, which means I can bypass all of the back and forth of the actual ODBC layers.
Now, of course we could have used ODBC and then we’d be back [00:11:00] in that same binary problem, where we had those binary dependencies that we have to deploy and make sure we have them built and they run on different platforms. And also we’ve got to call into them via Seago and just increases the complexity as opposed to just, if we get a pure Go written library. So of course, I made the big idea of, let me try to write a pure Go library. Now into all of the encryption and authentication, I needed to deal with GSS, API and [00:11:30] Kerberos.
The result was a library I like to call it Go-Drill, because it was based on hitting a basis on the Apache Drill Native C++ client, and then built a native Go based on that. In order to build that out, I ended up having to do a lot of researching and reading of the underlying Kerberos code, reading various RFCs, added debug, the GSSA API simple authentication and security layer, [00:12:00] which is not very, very simple. Because I needed to see exactly what bytes were being sent back to associate that with all the specs. Now, for a simple use case I was testing it with, we saw 300% improvement in the speed. It was almost entirely based on the data transfer, plus handle how we handle the data in memory. And as I increased the size of the data I was transferring and dealing with, the improvement got better.
We got even better performance, because [00:12:30] of dealing with this improved protocol and this improved way of transporting the data. So of course you might be asking, wondering, what is it about that protocol? Well, the protocol is fairly straightforward. We got a variant and coded length of the whole message. The message itself is a protobuf, encoded message that contains a header. Then it contains raw bytes, which are another protobuf message, which is the actual messenger sending, [00:13:00] and then the optional raw data at the end. And that optional raw data is the key piece here, because when you make a query and then you get your results set, you have a protobuf message in that actual message, which is the description of the data. And then the raw data comes down in that optional spot as no serialization, no deserialization, just a particular format.
So the [00:13:30] format of that data, if the data contains no values, it’s going to be one byte per value, containing a one or a zero for whether it’s null or not. Then if it’s a variable length type, such as strings or byte arrays, things like that, then you get the number of values plus one bytes. Actually you [00:14:00] get 32, so it’s actually four times the number of bytes. Which will tell you at what offset into that data, the value starts. And then you just get a continuous bytes of data, period. Which means that because of the benefits of Go, where you can just create slices, that reference particular sections of memory and sections of that array, I don’t have to copy it. I can just interpret and utilize the data as is right [00:14:30] there. So the data comes across the wire, I have it in the memory, and then I don’t have to copy it, reference it and deal with it.
It’s very straightforward, very easy, avoids any copying and all the schema information I need is that protocol message that was above it. Now, if you’re using ODBC, your driver might be implemented in this way, or it might not. You don’t know unless you know the implementation of the ODBC driver, but even then you have to get the data back from the driver. Now [00:15:00] some ODBC drivers do allow you to directly reference the data, but again, unless you know how your ODBC driver is implemented, you won’t know for sure.
This kind of direct referencing of the memory and the direct handling of it, is one of the main reasons of how we got that performance increase. Go Drill is open source. Like I said, it works with both Dremomys and Apache Drill. And it provides [00:15:30] both a direct client, you can see an example of on the right side there, of how you would might use it. You just connect by passing it the options you need and submit a query, and then you can just loop over the batch of records.
It also provides adapters that the data returned as Apache Arrow, but even better, it can actually be used with Go Standard SQL Passing Package. Which means that any of the other packages you might use with Go, that are built on the standard SQL database [00:16:00] package that Go has as a language, will work using Go Drill as an underlying driver. So you can get the benefits of the performance improvements without having to give up whatever potential adapters you’re using, you can still be using on the top of it. And it’s also concurrent and safe, which for Go is a big deal, since with Go, you’re improving and dealing with a lot of that concurrency.
[00:16:30] At this point, we’ve got our Go Drill, we’ve got Dremio. Now, we could just put that interface to the users, but when you’re building a service like this, and you’re building an application. You don’t want users to have to write raw SQL, if your users aren’t the other engineers on the data. There’s a lot of reasons for this. One big reason is that it’s very, very easy to write inefficient SQL. It would also [00:17:00] mean that you’d have to expose your underlying database schema to your users, which means if you ever change your database schema, then you have to have your users change as well. You don’t have any adaption, you don’t have any interface layer that hides that implementation from the users.
And that’s really what we wanted, we wanted that flexibility so that if we ever needed to change something, we wouldn’t break users. [00:17:30] Given our current dataset and our interfaces and our communications, it’s actually pretty interesting and we ended up going with Graph Ql. Now, most people, if you’re already familiar with Graph Ql, might be interested hearing about that, given that we’re not using JSON as our data transport. To actually utilize the Graph Ql, all we had to do, was just convert the Graph Ql to SQL queries, and then use the driver.
And given, as [00:18:00] I described, that relationship model of all that deal data and the financial transactions and going to the buyers and the buyers do their company financial information to their investors, to the investments. We need to be able to model all those relationships and pivot easily and make it easy for consumers to describe what it is they actually wanted from us. By utilizing Graph Ql in this way, we also allow to having schema introspection to how we exposed the Graph Ql, [00:18:30] which is going to be different and separate from the underlying database schema. Which gives us this database agnostic situation, so that consumers can retrieve the data, however we want them to retrieve it. And we can change the underlying data model, however we need to be most efficient without potentially breaking anyone. It actually wasn’t too difficult to implement this, given that there’s already a Graph Ql package in Go.
Graph Ql [00:19:00] generally intends to be used with JSON. But if you swap your actual schema around, so that you’re dealing with columns instead of rows of data, then you can make it very efficient utilizing Apache Arrow. If you’re not familiar with Apache Arrow, I’m not going to go into a lot of detail, but just give a quick primer to understand what I’m talking about. Arrow is a in-memory column oriented format that’s implemented to improve [00:19:30] in-memory analytics, computations, and also makes the data model on the wire, where the transporting growth network is, exactly the same as it is in memory. By having a column oriented fashion, when you’re producing analytics on that data, you can process it by just reading a whole column and then the next column, the next column, or skipping columns, if you don’t need them. Now, you’ll notice that this is very, very similar to the data format that [00:20:00] I was utilizing with that Go Drill, because that’s how Dremio’s internal format is utilizing the memory.
So it wasn’t a long stretch to jump from that data to utilizing Arrow internally with the Graph Ql. We have the Graph Ql converse to SQL, made the request to Dremomys, data comes back, and then I can just continue referencing the data and avoid copying throughout, just keep referencing the same chunk of memory. [00:20:30] To get even one more up on our interactions in our data transfers, we’re utilizing the Apache Arrow Flight. Now our flight is this a protocol framework that’s been mentioned a couple of times as mentioned in the keynote, that is organized around streams of record batches that you use Apache Arrow. It uses GRPC, which is the Google Remote Procedure Call setup, that is [00:21:00] based utilizing protobuf, which means that it’s very interoperable with tons of languages, but there wasn’t a Go implementation at the time. So I made one.
Then I contributed that implementation to the Apache Arrow repository, so it’s completely open source, under the Apache license, in the Apache Arrow repository with the other implementations, so us and others can utilize the Apache Arrow Flight, server and client communication. [00:21:30] The basic design of it is very similar and based on the C++ implementation. And it’s pretty straightforward. You can see an example here where all you have to do is, you make a new Flight server, you initialize it on whatever port or IP you want to host it on, and then you just have to provide implementations for each of the routes that you want to serve. Any route you don’t provide an implementation for, will just get [00:22:00] returned with a non implemented error if you request it. If you need to handle special authentication, there’s an interface, as long as you implement those two functions and then pass that object.
In this case, you see that new Flight server is taking a [inaudible 00:22:15] did not use the auth handler, but if you create an object that handles an authentication and validating the token, you can just pass that in instead of [inaudible 00:22:24] and then it’ll handle whatever application you want. And then you can see on the right that interface for the server is a very straightforward. You [00:22:30] just initialize it, it can return its current address, you can tell it shut itself down based on signals, such as Sigint or interrupts. Tell it to serve, you can manually shut it down or just wait for it to shut down, and then you can register the Flight service itself. The actual client is similarly simple. You make a new Flight client and it takes in its own off handler interface also.
So as long as it can authenticate [00:23:00] and it can close, you can pass it, that authentication. That off handler works the same way. You can even pass it any other geography options you need to use, such as passing it as a cell contact to TLS security setups or whatever you need, wherever your certificates are. And then you just call authenticate on the client to actually authenticate. And then on the right side, you can see a very straightforward example of how you’d actually fetch the data. You just do get, with your ticket and then [00:23:30] you just loop over the batches of records, the batches of arrays that come back, and do whatever you want to do the process or handle the records that you need.
So, as I said, this is open source. You can go to documentation on patches.go.org or go.doc.org, if that’s your preferred. We’re using it at FactSet for a couple of projects and looking at even basing an entire data platform on using [00:24:00] Arrow Flight and utilizing Dremomys as a backend there. And Dremomys even added recent enhancements to using basic off and bearer token authentication and contributed implementations for Java C++ and Python. And I’m going to be updating my Go implementation, to maintaining the parody there in the Apache Arrow repository. And I’m also going to be contributing relatively soon an example to Dremio’s repository, [00:24:30] utilizing Go to communicate with Dremomys via Arrow Flight.
We have Graph Ql and we have Arrow Flight for how we’re going to send the data back. Well, we take this one step further and use Arrow Flight in between Dremomys and Go in this case, because that way the data we get back from Dremomys is an Arrow record batch, that we can then forward or process however we need, [00:25:00] and then return to the users. At no point in the system are we going to have to pay for that serialization or deserialization costs, because since we’re using Arrow, the data is the same on the wire, as it is in memory, like I said. Which means that we can just pass the data all the way through to the end. So we get the speed benefits, the performance benefits, and the other integration utilities and benefits of not having to expose our entire database schema to the users [00:25:30] by providing them that Graph Ql user interface. When they call, you get their ticket, is essentially just the string of the Graph Ql requests they want.
So I’m just going to leave you guys with the links, you can go straight to the documentation for Go Drill. You can go to the documentation for Arrow Flight. The third link there, is the actual Arrow Flight spec on apache.org. You can see how it’s defined in the proto file and then [00:26:00] links for Dremomys site and FactSet on page two. Now come say hi, come see us. Yeah. That’s the end of that one. I’m going to take questions now.
Great. Thanks so much, Matt. Just a reminder for the audience that if you have any questions, put them into the chat section to the right. Additionally, we have a few asks of you. So before you leave today, if you could go into the tab called Slido, it’s right above your chat window, [00:26:30] there’s a three question survey that we have in there. So if you could complete that about today’s session, that would be fantastic. Additionally, Matt will be in this Slack community after the session for about an hour. If you have any questions or if you want to dive a little bit deeper in conversation with him, he’ll be there. So Matt, question for you, what is again, the advantage of using Graph Ql instead of just SQL?
[00:27:00] A couple of advantages there, is that the users that are actually communicating with the service aren’t other engineers, they’re actually the end users, the end consumers. And it’s very easy to write inefficient SQL when you’re dealing with very complex queries. And we wanted to make sure that we can control the performance and control the efficiency of the queries for our users. The other benefit [00:27:30] is that, if your users have to write raw SQL queries, that means that they have to know your SQL database schema, which means that if you change your schema, all of your users have to update. Whereas if I use Graph Ql as an interface between the SQL add for the user, that means that I don’t have to expose the whole database schema to the user.
Great. [00:28:00] Matt, did you see any reduction in end to end communication latency or throughput as a result of using Flight?
Yeah. Well, we saw a reduction in latency in some ways, but most definitely we saw improvements in the throughput. Flight in and of itself is not necessarily… Our record batteries are not necessarily small, but what we get of it, [00:28:30] is that we are using GRPC’s streaming protocols, that time to first byte is really short. Because we can stream the results as we process and get them out of Dremomys, as we process to get them out of their service and stream them to the user. So the user [inaudible 00:28:46] the service and then the user can get that result data as fast as we can process it and stream them those batches, rather than having to wait for the whole results that have to come back to the service and then the whole results have to go [inaudible 00:29:00].
[00:29:00] Got it. Paula said, “Put slightly differently Graph SQL hides, implementation details of the database table joins column names, et cetera.”
Exactly. That’s exactly the way of doing it. And also, we’re utilizing Dremio’s virtual datasets to optimize those joins and that logic, so that when we’re dealing with that communication between the Graph Ql to the SQL, the implementation details [00:29:30] are further hidden to make the Graph Ql to SQL conversion very easy.
Got it. So Roger asks, “Was use of Go a language preference matter or was there some intrinsic advantage per the use case scenario?”
So, I mean, towards the beginning of the presentation, if you joined late, you might have missed it. Internally, in fact, we have a Heroku like platform as a service that deploys via containers. And [00:30:00] one of the benefits of Go in this case, is that it has the static compilation. And by using Go libraries all the way down, we get a statically compiled binary that we can just deploy with our containers. If we were using something like C++ or others that are using dynamic linking, but the other libraries are dynamically linking to the ODBC driver. Then we have to make sure that we have those compiled versions of those dependencies. [00:30:30] Yes, technically you can stack, compile them all together, but then you have all the potential drawbacks of larger binaries. And I wasn’t going to get into the full discussion on static linking versus dynamic linking. But aside from personal preference, there was also those other benefits that we got. But personal preference was a part.
I got it. It looks like that’s all the questions [00:31:00] we have for now. So just a reminder that Matt is available in the Slack community. Additionally, I also posted a link to Dremio’s virtual lab, that’s available gesturing sub-service. This is an opportunity to spin up Dremomys really, really quickly, it should take less than a minute and you can get 30 minutes of lab time in there. You can play around with the app. So a lot of people really love this session, Matt. So thank you so much, we really appreciate it. Thank you to everyone who participated, who asked questions. We really appreciate [00:31:30] it. We hope you enjoy the rest of your sessions today.
Thanks everybody. I hope to talk to some of you guys on the Slack.
Thanks Matt. Bye.