Subsurface LIVE Winter 2021
High-Performance Big Data Analytics Processing Using Hardware Acceleration
In order to address the challenge of increasingly expensive and time-consuming big data analytics pipelines, hardware accelerators such as GPUs and FPGAs are increasingly being used to reduce the overhead associated with data processing, and improve the utilization as well as the cost and power efficiency of compute infrastructure. These systems are being integrated in various cloud services such as Amazon and Nimbix, and have become a prime feature of the Microsoft Azure offering.
This talk will give an introduction to FPGAs and discuss their advantages and challenges in the context of big data analytics. We’ll also discuss Fletcher, an open source platform to integrate FPGA accelerators with big data analytics frameworks efficiently. Based on Apache Arrow, Fletcher is intended to tackle the challenges of long development times and poor cross-platform support, and FPGA components are easily integrated into Arrow pipelines. We will present several high-throughput applications where FPGA accelerators are integrated into big data analytics pipelines. This includes regular expression matching achieving up to 60x acceleration, Parquet decompression and Arrow conversion at 3x acceleration allowing real-time Parquet data ingest, and ultra-low latency JSON to Arrow conversion.
Finally, we will demonstrate FPGA integration into Dremio, allowing for the transparent acceleration of SQL queries on high-performance accelerators.
Zaid Al-Ars, Associate Professor, Delft University of Technology
Zaid Al-Ars is an associate professor at Delft University of Technology, where he leads the Accelerated Big Data Systems group, focusing on developing computing infrastructures for efficient processing of big data analytics applications. Zaid is also co-founder of a couple of big data companies specializing in high-performance analytics solutions and AI, and serves on the advisory board of a number of high-tech startups.
Hello, and welcome to our session today, “High Performance, Big Data Analytics Processing Using Hardware Acceleration” presented by Zaid Al-Ars, associate professor at Delft University of Technology. We’re very excited to have Zaid here today. It’s a little late in his time and he’s based in the Netherlands, but we appreciate him staying up late to present to all of you today. Couple [00:00:30] of housekeeping notes, just to remind everyone how the Q&A works at the end of the session. You’ll see a button at the top right-hand corner of your screen that says, ask to share audio and video. If you do that at the end of the session, I can go ahead and let you on camera and you can ask your question live and in person.
If you prefer not to be on camera, you can go ahead and put your questions for Zaid in the chat and then I will go ahead and moderate that for him, and we can ask those questions at the end. [00:01:00] Also, if you look on the right-hand side of your screen, you’ll see a little tab that says Slido. If you click that, that is going to pop up a real fast session survey for us. Those sessions surveys are really important, so we can continue to improve the content that we bring to you at Subsurface.
So if you could, after the session, just take a couple seconds to go in, click those star ratings before you leave the room. And then you’ll also see that I’ve pinned a chat [00:01:30] URL at the top of the chat for the session, that is a link directly into Zaid’s Subsurface Slack channel. And he will be in there after the session to answer any questions that we don’t get to. So with that, I’m going to go ahead and turn it over to Zaid.
Zaid, thank you again for being here and I will see you at the end of your presentation.
Thanks a lot, Melissa.
Talk to you soon.
See you later. Talk to you soon. Thanks to everybody for [00:02:00] being here in this talk. Today, we’re going to discuss how we can use high performance computing technologies to be able to address two of the most important, big data challenge analytics challenges. On the one hand, the long time it takes to execute or run some of these big data analytics algorithms. And on the other hand, the high cost of running these algorithms in practice.
My name is Zaid Al-Ars. I’m [00:02:30] a co-founder of Teratide, a startup specialist in bringing these technologies to big data analytics and a professor in Delft University of Technology in the Netherlands. First, I’d like to present our company. Basically we are a spin-off, high-tech spin-off from Delft University of Technology. We are focused at the intersection between high-performance computing and big data analytics, bringing solutions to achieve high throughput, [00:03:00] low latency, and low power consumption for big data analytics.
These solutions are getting more and more important for the big data analytics world as the volume of data is increasing and as the time it takes to execute these, analyze these data is growing very fast. Now we have a team of researchers and industry veterans with many years of industry experience, and we have a lot of collaborations with [00:03:30] a number of tech industry heavyweights like IBM, Intel AMD, Xilinx, to bring their latest and greatest technologies to the big data analytics world and to the cloud.
Now, first, before we start, I’d like to discuss a little bit the sources of inefficiency for big data analytics algorithms. Now a lot of the efforts and big data analytics is focused on the fancy, sexy kind of part [00:04:00] of analytics, the algorithms themselves, the AI, the deep learning machine learning. But actually a lot of the efforts that is needed to be able to allow these algorithms to do their work is more related to the data engineering side and rather than the data science side.
And that’s why we are all here in Subsurface, addressing many of these data engineering challenges. There is a lot that needs to be done to be able to get to execute some of these algorithms and practice. You need to go through [00:04:30] data collection from various different sources and sensors, ingestion of this data, storing it, moving it around, being able to explore the data, transform it, clean it, prepare it pre-process it with labels and so forth before it can actually achieve, do any algorithms on it.
So we can divide these challenges into two different categories. One of them is related to the data analysis, which is a data computation issue. And the other challenge [00:05:00] is more related to the data collection, which injects or results in increasing the latency of the data coming into your system. And we’ll discuss later on in this presentation, how we can use this, the high-performance computing systems to address each one of these two challenges in a special use case that we prepared to solve them.
Now, if you look at the various different systems being used [00:05:30] in high-performance computing to address computational challenges. We of course know the CPU, which is the processor that is the powerhouse of the computer server and the cloud. And basically the CPU we can compare to a very fast car. Now, a car can take you anywhere very effectively and very efficiently. You can do all types of chores with it, and it’s very effective in what it does, but it’s not really very specialized [00:06:00] in any type of workload.
Now, the other type of system that you can use in big data analytics or in high performance computing is the graphics processing units or the GPU, which is becoming more and more available. Many of you probably have heard of it or even used it. And you can compare the GPU with a race car, which is very powerful, very effective when you want to go on the highway very fast or go fastest [00:06:30] from distances, but it’s not very effective to drive with it. For example, within the city. Doesn’t give you any extra advantage, competitive CPU. And then we have the new kid on the block, the FPGA, so the field-programmable gate arrays.
Probably many, if you haven’t heard of it, but it is being deployed more and more in the data centers. And this is to be compared with a formula one race car. It is very fast, custom made to solve your problem, custom made to the driver [00:07:00] specifically designed to be able to go very fast on a racetrack, but probably using it to go to the supermarket is an overkill. So each one of these systems is able to actually solve its own type of computational challenge. And it depends on the kind of application complexity.
So, the GPU is very powerful and effective when you have algorithms that are very low in complexity and have a lot of parallelism. So, it has a lot of simple cores that [00:07:30] hundreds, even thousands of cores that allow you to do a lot of parallelism and solve these problems efficiently. The FPGAs are very powerful and effective when you have applications with moderate complexity. It has a lot of parallelism, but it requires a number of different operations to be performed, specialized operations so you can deploy them on the FPGAs in a specialized way. And the CPU is very effective when you have algorithms with high complexity so you require a lot of sophisticated algorithms to be able to solve [00:08:00] and do the analytics, and the CPU is very powerful in there.
Now, we in Teratide bring together all these technologies and provide a stack that allows you to use each one of these systems together in the same infrastructure. So the cornerstone piece in our stack is based and built around their Apache arrow in memory data formats. Apache [00:08:30] arrow is a very hardware friendly data representation memory which allows you to communicate with the CPU, with the GPU even with the FPGA without any overhead and without the need to copy data back and forth. In order to be able to communicate between the Apache arrow format and the memory and the GPU, Nvidia which is the biggest producer of a GPUs in the world, developed the RAPIDS interface, which allows you to effectively and efficiently copy data back and forth [00:09:00] between the memory and the GPU and process it there.
And we in Teratide developed the only interface that is available today, publicly available today to communicate data between the FPGA and the main memory. This interface is called Fletcher. It allows you to automatically and efficiently copy data back and forth and process it on FPGA without any overheads.
Each one of these platforms, as we said, has its own unique advantage. When it comes to GPUs it’s unique [00:09:30] advantage is in processing significant amounts of parallel applications that are very simple. FPGAs bring with them also a number of different special advantages like being able to connect them directly with storage or with the network. Allowing your data to flow through your FPGA and gradually being pre-processed and prepared before they even hit the memory or even get to the CPU. Leaving the CPU [00:10:00] to do a lot of the complex analytics that you would like them to do on your data and having the FPGA, pre-process all this data in flight. This allows you to get significantly lower latencies, allowing you to get the data in the way you want it, where you want it, very fast.
Now, just because FPGAs are new and they sound like magic. I’d like to describe a little bit, how that work [00:10:30] they are able to do, can be achieved. Basically you can think of the FPGA not as a processor, you can think of it as a Lego block. It has a number of different Lego pieces that you can combine together to build a compute system. It has IO blocks. It has logic, small logic devices. It has a simple digital signal processing devices, and it has memory units that you can combine together so that you can create the solution that you [00:11:00] want and be able to solve your algorithm, not in software, but actually directly on the hardware.
Now, creating your algorithm in the hardware itself has significant advantages compared to solving it on a processor. For example, there is no operating system. You’re not going to run straight away on the hardware without having to run any overhead in instruction, translation or in data copying. All the data and instructions are present in the hardware and are executed instantaneously on your data. [00:11:30] Also, you can use it in flight where data can flow through the processor, through the FPGA and being processed gradually before it hits your memory, which allows you to do very low latency, very high-performance computation. And also, it is being deployed more and more data center because GPUs and CPUs are facing a number of limitations with limitation, that we’re facing by performance limits and the cost limit for [00:12:00] their deployment.
Now, we are able to bring FPGA to the data center and to big data analytics workloads, with our Fletcher framework. So what Fletcher is, basically it’s an open source framework that allows you to read the arrow schema, Apache arrow schema, and create automatically create hardware interfaces that will be able to bring [00:12:30] data in from the host processor to the accelerator, to the FPGA, processes data, and take it back to the main memory to the processor, to be work on further.
So, it has two components. It has a design time component that you use to generate the interfaces and hardware. And it has a runtime components that actually manages the copy, the streaming of the data from the CPU to the FPGA and back to the CPU.
It allows you to have quite high performance, basically managing [00:13:00] the performance of the memory interface. We’re talking about 16 GB/s for PCie Gen 3 interface, and more than 20 GB/s for an OpenCAPI interface to the accelerator.
Now, that being said, I’d like to discuss two use cases that allow us to demonstrate the effectiveness of using FPGAS in the data center to allow us to get the advantages and the high-performance [00:13:30] computational capabilities of these hardware accelerators.
The first use case is related to the data analytics challenge that we discussed earlier and the second use case is related to the data collection, reducing the latency of a data collection that we discussed earlier as well.
So the data analytics use case is basically an acceleration of a regular expression matching engines. [00:14:00] Now, this use case is basically integrated with Dremio. We have an FPGA engine that actually executes a regular expression matches, very efficiently and very effectively. And that integrates transparently into a Dremio execution plan. So, the Dremio has its own query engine, of course, and you can export any kind of queries on it. The specific [00:14:30] query that we would like to show here is shown here in the blue box at the bottom. It’s basically, you get a huge number of strings, a large database of strings in. You compare these strings with a specific query string, and then you filter these and then count the number of indices where this string is matched in the database.
Now, as Dremio executes this query, it brings the query in and it brings [00:15:00] parquet plan. It goes through an execution planner with a number of different stages until the output is being calculated. What we do is that we automatically introduce an extra FPGA acceleration stage into the execution plan. And this execution stage basically replaces a number of compute steps with a single FPGA compute step called here, the Fletcher Filter Project. Now this Fletcher Filter Project is basically, automatically [00:15:30] takes the data, streams to the FPGA, executes the regular expression matcher there and then brings the data back all the way to the output without even Dremio having to notice that there is a different engine being executed under the hood.
We took this whole system, we executed on Amazon AWS on their FPGA F1 instance, and we executed there in three different [00:16:00] flavors. So one flavor is basically the unoptimized, just out of the box, Dremio vanilla, regular expression matcher. And another version that is basically what we replaced the regular expression matcher of the vanilla version with the fastest available CPU based by the expression measured in the industry, the RE2 and both of these are executed on the CPU.
And then we replace [00:16:30] it again with a GPU, FPGA accelerated version that is developed by Teratide. Now, the performance difference is staggering. You will see that the FPGA implementation is able to achieve 3.3 GB/s, in terms of throughput. As compared to about 240 MB/s for the fastest possible CPU implementation in the field. That’s about [00:17:00] more than 10 times throughput improvements, when you use the FPGA solution.
If you compare that with the Dremio vanilla version, just without any optimizations. We’re talking about more than a hundred times more performance improvement in execution. Now, this implementation is publicly available. So, we invite everybody to go to our GitHub repository, teratide/tidre-demo and download this implementation and executing themselves. [00:17:30] You can execute it yourself on AWS and just experience how much benefit and performance you are able to achieve when executing these queries on FPGAs.
Now the second use case, as I mentioned, is related to data collection, to reducing the latency of the data ingestion, had the ingest. Now, this use case is basically related to accelerating a JSON parsing, JSON parser. [00:18:00] JSON is one of the most widely used format in this industry. And basically what we wanted to do, is to be able to get the data in and reduce the latency of getting the data in and giving it to our processor.
We run this as an application with one of our partners, Sigma X. We run this, as well on the Sigma X Stack and using the Sigma X infrastructure and the systems. [00:18:30] What we did here is that, basically we had a number of different data sources that are sending JSON formatted data. This data goes right away into the FPGA, accelerated on the network. And then the FPGA is able to ingest this data, pre-process it, and reformat it into an Arrow formatted IPC messages that are further taken to Apache Pulsar Broker.
Now, Apache Pulsar is a big [00:19:00] data. This is really a big data messaging engine similar to CAFCA, but it is much lower latency and much higher performance. And then Pulsar will be able to send the data further to downstream consumers.
Now, the FPGA engine here has multiple different stages of processing. It takes the data in with a network physical interface, stores it further into a buffer, then transforms [00:19:30] into JSON records and then converts it in a number of different stages, parsing and then through Fletcher into a Arrow IPC message then actually taken by Pulsar. If we compare the performance and the latency of this solution compared to existing Cal CPU solutions, we’ll see that our FPGAs solution is able to achieve microseconds ingestion times.
Now, we’re talking about microseconds compared to milliseconds. These are many orders of magnitude, lower [00:20:00] latency compared to state of the art at the moment, which is basically eliminating latency as a bottleneck for the application at ingest.
We’re also able to get you watch our performance compared to the CPU. We’re seeing here a performance of about 4.5 GB/s, compared to using all the CPUs, all the processors in your server, just for parsing JSONs. We’re talking about six to seven times more performance. And we see here that the performance [00:20:30] is actually not limited by the FPGAs. It’s limited by the data, by the memory interface. There’s a lot of more capacity on FPGA to do more processing, to do more pre-processing or data forming, if you would like to do that. If you want to have more performance, that’s also possible by introducing a higher bandwidth, the interface, and then you can achieve up to 25 times more throughput for your ingest.
All in all, we can achieve in Teratide an [00:21:00] advantage by combining high-performance computing systems like GPUs, FPGAs and specialized CPU solutions to get more than 10 times higher throughput for long running algorithms in the cloud. We’re able to achieve much higher cost effectiveness, reducing the price of running these algorithms by more than 50%. We achieved ultra low latency, allowing you to do wire-speed, real time analytics. And bringing all these [00:21:30] advantages together, we’re able to achieve much higher energy efficiency, enabling the green data center, achieving efficient data, energy efficiencies of ten to a hundred times higher than what you have in the industry, at the moment.
Our system is publicly available. You can go and check out our GitHub repository, play around with our solutions. They are integrated with Dremio, Spark and Desk and other big data analytics frameworks. We invite everybody to go [00:22:00] and check it out. And if you have any questions or would like to collaborate with us, please send us an email at email@example.com. Thank you very much for your attention.
Thanks so much, Zaid. So, we are in the Q&A portion of our session. If you would like to ask a question live, go ahead and hit that button up on the right hand side that says, ask to share audio and video. And in the meantime, what I’m going to do is, I’m going to go ahead and take some questions from the chat.
[00:22:30] Any tips on topics to explore and practical research to do in the space of big data, hardware, accelerators for engineering students?
Definitely. So, basically it’s a hot topic at the moment. It’s a very fast growing field. So, I would recommend that people would start with investigating existing solutions in GPUs or FPGAs [00:23:00] that are already out there. There are lots of resources. If you look at the intersection between big data analytics and accelerators, you look at the repositories that for example, Nvidia is providing with their RAPIDS project, or you look at the big data push in the cloud with Xilinx. And of course you can come also at our website, our data repository. Teratide has a number of different solutions that you can experiment with.
[00:23:30] Great, thanks. So, Harsh Desai would like to know what is configuration setup overhead for FPGA enabled machines with on-prem big data cluster like CDP.
Yeah. So, it depends on what you mean exactly by configuration setup. So, it could mean a couple of things. If you would just look like a spin in FPGA cluster, then it doesn’t have any specifically extra overhead in just getting back up and running in the cloud. [00:24:00] So, that doesn’t introduce any extra overhead. But if you would like to actually use that and execute your instructions to change your implementation with FPGA, as you execute your instructions in real-time, then it would have a little bit of an overhead.
It depends on how big your application is. So if you have what we call a bitstream, right? The application, let’s say footprints, the [00:24:30] binary, how big it is. If you have a small footprint for this bitstream, then it does not require a lot of time, but the bigger your application, of course, it will take a little bit longer time to be able to reconfigure it and run it on as accelerated.
Great. So, let’s see. John Clem wants to know, “I see a lot of comparisons from Fletcher, FPGA to CPU. But can you also provide some comparisons against the state [00:25:00] of the art Nvidia GPUs? I suspect the acceleration would be much less dramatic, but obviously FPGA should have the edge for some use cases?”
Sure. So, that’s a great question actually. I show exactly FPGA solutions. And as I said at the beginning FPGA solutions, I’m showing these now because these are the new kid on the block, right?, they are. They have a lot of advantages compared to other systems. But of course FPGAs are, as I said in the beginning as [00:25:30] well, they are not the silver bullets, so they are not good everywhere. There are different applications that are good for different use cases. So, this means that FPGAs are good for the applications that I’ve shown, which is what I mentioned at the beginning. They are moderately complex algorithms with a lot of parallelism. But if you talk about GPUs, definitely Nvidia GPUs have the upper hand, when you talk about very simple algorithms or very simple programs [00:26:00] that have significant amount of parallelism.
So, what we call embarrassingly parallel algorithms. Basically for the things that graphics processing units were made for like image processing have lots of pixels that have to be processed in exactly the same way, very simple manipulation, then nothing can beat GPUs in processing the scan of application.
And the same way, if you have very complex algorithm that requires a lot of different [00:26:30] aspects, lots of different branching, a lot of different conditions to be executed, then nothing beats the CPU there, not the GPU and not the FPGA. So, you have to remember that every platform is suitable for its own specialized catalog application domain, and you can achieve the best price point by bringing all these solutions together and combining them in a stack, like we only provide.
Great. So, I’m seeing a question here, “ [00:27:00] Any thoughts on H.265/264 codec to Arrow via FPGA?”
Nice question. So, this question that we actually thought about ourselves as well. Yes. So, it is possible to do codec on FPGAs through Arrow. And it is of course, the question, whether how much advantage will you get by creating an [00:27:30] Arrow based interface for your codec, for image processing on FPGAs.
As you know, image processing is a very widely used, very standardized application. So, there are lots of manually optimized interfaces for it. There could be a case where you can use a batch arrow to standardize that interface in a more kind of a generic way. And it will not give the same performance, of course, [00:28:00] as a manually optimized interface but it will be able to be a little bit more generic. So, if you want to hit a low development time, but not extremely high-performance, then you can use an Arrow interface, but if you want to hit the most optimized interface with the highest performance, but you would like to take a longer time to optimize that interface, then you should use a manually optimized one and not an Arrow interface.
Okay, great. So, the next question [00:28:30] I have is, “On the ingest of CSV data, can you please elaborate Teratide/ FPGA architecture?”
Ingest or let’s say parquet ingest or JSON ingest, they are all different types of ingests. So, and they are all very well suited for indeed using FPGAs and Fletcher interfaces [00:29:00] with Arrow data format at ingest. So, we did not show a CSV ingest here, but we already have multiple implementations for JSON ingest as well parquet ingest, and we can add one CSV and just as well there. And this would be also very well suited application. We expect to get similar kind of performance, as we have shown for the JSON ingested there. We don’t expect the FPGA to be the bottleneck anymore. We [00:29:30] will be able to eliminate the ingest as a bottleneck for all application, and there will be a plenty of room in the FPGA to do even more pre-processing afterwards.
Great. I think that, that’s probably all the questions we have time for today because we’re just about at time. What I’d like to do is remind people that Zaid will be in his Slack channel for the next 30 minutes answering questions that you might’ve put in [00:30:00] the chat that we couldn’t get to. You can see that a URL pinned at the top of the session chat there. And then the other thing I’m just going to remind everyone, to take a minute, go to that Slido tab on the right, fill out, answer those quick three questions about the session so we can continue to improve on our content here at Subsurface. And then I want to invite everyone to go to your next session. Our next session is Aero flight and plate SQL accelerating data movement presented by [00:30:30] Kyle Porter, the CEO of Bit Quill Technologies.
Zaid, thank you so much for presenting at Subsurface. We hope you had a great time and hopefully we’ll see you at the next one.
Thank you very much, appreciated it.
You too. Bye bye.