Subsurface LIVE Winter 2021
Enabling Real-Time Analytics for Data Lakes with Apache Ignite
For data scientists and engineers, there’s no finer place to spend time than a data lake. But are you really getting what you want out of it when +80% of your time is spent sidelined with data preparation?
Every data scientist and engineer who wants to stay relevant over the next decade needs to start asking themselves the tough questions today: Do you want to be someone who drives real business outcomes or just keeps the lights on? How much time do you actually spend innovating with data? How much simply goes into keeping your pristine lake stocked with fresh data instead?
In this presentation, Matthew Halliday will show the audience how no-code ETL gives data teams the freedom to move fast and innovate with data. He will bring it to life by highlighting use cases from one of the world’s largest quick-service restaurants, a Fortune-10 consumer electronics manufacturer, and a major US federal credit union. Specifically, he will demonstrate how to connect any business application – from Google Sheets to Salesforce and SAP – to a data lake in 10 minutes or less, generating transactional, in-sync Parquet files that anyone can seamlessly leverage for data science.
An award-winning keynote speaker, Matthew guarantees that his presentation will be the most exciting demo of Subsurface LIVE Winter 2021.
Denis Magda, Apache Ignite PMC Member and Committer, Head of DevRel, GridGain
Denis Magda is an open source enthusiast who started his journey at Sun Microsystems as a developer advocate and presently works at Apache Software Foundation in the roles of Apache Ignite committer and PMC member. He is an expert in distributed systems and platforms who actively contributes to Apache Ignite and helps companies to build successful open source projects. You can be sure to come across Denis at conferences, workshops and other events sharing his knowledge about open source, community building and distributed systems.
All right. Hello, everybody and thanks for joining us for this session. I wanted to go through a little housekeeping before we start. First, we will have a live Q&A after the presentation. We do recommend you activate your microphone and camera and talk to our speaker here live, but if you’re a bit more shy, feel free to ask your question in the chat as well, and I will go ahead and ask our speaker, Denis, that question. [00:00:30] Also, just a reminder to complete the Slido survey. That’s S-L-I-D-O tab on the top right side of your screen. It’s a great way for us to get feedback on the conference and on the session in particular, it takes 20 seconds, so that’s super duper easy. With that, I am delighted to welcome our next speaker, Denis Magda, who’s the Apache Ignite PMC member and Committer, as well as the head of developer relations at GridGain. Denis, [00:01:00] over to you.
Thanks, everybody. Thanks for the introduction and joining this session. Let’s get straight to the point. Let me share my screen with you, and we’ll move from there. That’s not the slide I want to start with. That’s the slide. We’ve got 30 minutes. [00:01:30] Today I’m just going to scratch the surface. The primary topic for conversation is analytics, for sure, and how can we improve our analytical operations and queries with in-memory computing, with in-memory systems such as Apache Ignite. Before we move on, a little bit more details about myself. For the last six years, I have been dealing with in-memory [00:02:00] systems and distributed databases. I’m an Apache Ignite Committer, and Apache Ignite, as you will learn throughout this presentation, is a distributed database, that is used as a cache or as a memory one.
Also I have the luxury of running developer relations group at GridGain, that’s one of the major company behind Apache Ignite. That’s the company that donated Apache Ignite to the foundation and remains one of the major contributors. Before distributed [00:02:30] systems, before Ignite and GridGain, I spent a lot of the time working for Sun and Oracle. Those days, I belonged to the Java group. I was running and developing, supporting Java virtual machine and GDK for microcontrollers and embedded devices. My professional career studied started at Sun. Those days, I was championing many technologies for that company, not only Java, but [inaudible 00:02:57], Solaris, and a myriad of others.
[inaudible 00:03:00] [00:03:00] to talk about our primary topic. Analytics, real time analytics, and how is it related to in-memory computing? Before we dive in in-memory computing landscape, let’s review challenges related to the real time analytical workloads. That’s a marketing slide. Usually I don’t include those in my presentation, but this is a special one I find useful. Let me explain you, what’s conveyed to us here. [00:03:30] For the last year or so, if not decades, we see a significant increase in the data volume that is generated by applications and, in return, the amount of operations that are executed at every given second in time. For instance, if to throw in some ballparks, presently our systems and our stores and databases experience like 10, if not, 100 times more operations [00:04:00] per second. All those separations that executed over, 50 times as much data as we used to have a decade ago.
That there is much more data generated, it’s obvious, right? We all know that and many more queries are executed over the data, but there is also one requirement that we are seeing in place right now. That requirement is related to analytics. Most of the data ends up in data lakes, let it be Hadoop [00:04:30] or any other solution. But right now businesses and the companies, they want to use and get [inaudible 00:04:39] from the data much faster, they want their overnight analytic to turn into a real time analytic. Instead of waiting for a response of that BI report that will be returned in a minute, or in an hour, I want to get it done, right now. In [00:05:00] a few seconds or dozens of seconds.
How is it related to in-memory computing? Those challenges? There are multiple ways on how you can improve, modernize your existing system. It’s not a big deal, but with in-memory computing, and I am representing a project that is developing a distributed database that belongs to the in-memory compute and landscape. In-memory computing is a seamless and straightforward [00:05:30] way that can help you to optimize what you have right now. These little stress. For instance, you can keep your existing data lakes or any other databases that already store massive amounts of data. If you’re introducing any in-memory computing layer, such as Apache Ignite, you don’t need to rip and replace. You need to brake things. You don’t need to discontinue using, let’s say, Hadoop and switch to Ignite instead. You can merge [00:06:00] and use them together.
That’s the beauty of a product such as Apache Ignite. You basically can start with Ignite as a cache in in-memory layer, you slide it in between your applications and your data layer, and you go from there. After that, as long as data is in-memory, you can execute your queries much faster and the system is scalable. Those are business issues that are causing and that are requiring us to run real time analytics. That’s from the high level [00:06:30] perspective, how these can be solved or optimized reuse in-memory computing.
Right now, let’s go into details and let’s have a technical conversation. Why in-memory computing? That’s the first question we should ask before we decide to do even any POC. The answer is pretty simple. It’s all about speed and scale.
The scale of in-memory computing systems comes from their well-known horizontal scalability. [00:07:00] As we use many disk based data lakes or databases, we use Apache Ignite, for instance. You can start with a three node cluster. That cluster might be more than enough for your current data capacity and for the current volume of your requests. But then over time, if you need to feed in more data sets or you need to get more CPUs for your real time analytical queries. What you do? You just scale out. You add five nodes. [00:07:30] You add seven notes. You add as many as you need depending on your current situation.
What’s about performance? What’s about the speed? Here, it’s all about the laws of physics. Let’s check this table. That’s not my invention. I found this table on the internet. I bet that some of you might have come across it as well. What this table shows? In the first column, you have some system events [00:08:00] that are usually executed by CPUs. The second column we have some ballpark number. How much time does it take for the CPU to execute any operation from the first column. The beauty is in the third column, because while the second column shows us real latencies that are extremely small, nanoseconds are small, microseconds and milliseconds are also small units, but we live in the universe where the time is measured in seconds, minutes or days. [00:08:30] That’s why the scale latency is the latency for our human universe. What the author did? He assumed that four nanoseconds are equal to one second in our universe, and then the author translated to the rest of the latencies.
Once you do this translation and check these three rows, you will see that the main-memory access latency is so much lower if to compare it to disk I/O, that it [00:09:00] will be perceived by us as the difference between minutes and days. When you’re trying to read data from memory or load it from disk, it’s like the difference of doing something in minutes versus days. That’s actually the beauty of in-memory computing systems, they’re scalable, they’re fast, just because those are the laws of physics, and we can use them for our analytical workloads really fast. With little stress and without braking our existing systems.
I will be talking about Apache [00:09:30] Ignite today. Certainly Apache Ignite is not the only one distributed database that can be used as in-memory layer. I’ll taken it into account because I’m representing this project. Ignite is one of the top five level projects of the Apache Software Foundation. In the Software Foundation we have more than 350 projects, and it’s used by many well-known companies. Some of those brands are on this slide. When it comes to the [00:10:00] feature set of Apache Ignite. Without going into details. As a distributed database, it’s comprised of two main components. First is multi-tiered distributed storage, and the second one is various APIs, that are well integrated into that database engine. When we’re talking about the storage, Ignite by definition can keep data in memory in a scalable way.
When [00:10:30] it comes to disk, you have two options. Ignite can work or integrate with your existing system, such as Hadoop or any other data lake. It’s up to you. Or it can persist all the records in its own disk tier known as native persistence. In that way it turns into a scalable distributed database that grows beyond available memory capacity. On top of that storage, we have various APIs. For us and those who are interested in real time [00:11:00] analytics, we will be looking into SQL. SQL is the king of all the APIs, the most universal language, probably. Also we will review compute APIs for high-performance computing, because you can run your analytical code right on your cluster we will briefly touch point on machine learning. The rest is up to you. If you’re interested check later.
Let’s right now talk about one feasible architecture. The architecture that I’m going to walk you [00:11:30] through is not the only one, but it’s typical. Before developer relations, my story with Apache Ignite and GridGain is quite joyful. I used to join Apache Ignite as an engineer, and I was contributing to the networking and storage component of that system. Then I switched my interest to the collaboration with application developers and the architects. I used to work with support organization, [00:12:00] sales engineering, and I was wearing many hats. This architecture is basically what we used to deploy with many architects working for well-known companies. You can always use something else, but that’s probably one of the golden standards, if you want to use Ignite or any other memory system.
What’s shown here? Let’s say we have some data lake system. It can be horizontally scalable. It can be vertically scalable. [00:12:30] It’s up to you. We are not putting any specific names here. The data lake already stores massive amounts of data, historical data that your applications generate and that your applications can be mobile [inaudible 00:12:43], it doesn’t matter really. Here’s we’re also introducing your in-memory layer. With in-memory layer you can deploy on top of your data lake, if you introduce any native integration for Ignite, but also [00:13:00] the easiest way is just to deploy Ignite separately, decide which data you would like to move from the data lake or copy from your data lake to your Apache Ignite cluster, and then just update your application layer so that the applications or your tools, BI tools, go to Ignite whenever they need to execute any queries, operations or complete any reports much faster.
Change data capture layer is optional for sure. It depends. [00:13:30] For instance, some of the applications might go and change some data sets in data lake. If you want to transfer those changes to Ignite [inaudible 00:13:38], you have some solutions out of the box. It can be [inaudible 00:13:42], it can be Apache Sqoop and many other capabilities. It’s up to your software stack. The same is true for Ignite. If Ignite keeps some golden copy of your data set, that needs to be transferred from time to time to Hadoop, [00:14:00] you can capture those changes and move to Ignite.
From the API standpoint., Your applications that used to use the data lakes before bringing in Ignite, they will certainly can continue using the native APIs, so your data lake environment, with Ignite. I will show you some of the APIs that are the most useful and widely used for this type of architectures. And also, what’s interesting, why do we have Spark here?
I’m showing Spark as an option whenever you need to do federated [00:14:30] queries. Let’s say your store month old data in your data lake, and you want to merge this data with the latest weekly data, that is located in memory and plus on disk in Ignite cluster. You want to merge, you want to run some federated query. Ignite doesn’t support this natively, at least yet, these federated queries. Also not every data lake provides such capability out of the box, but Spark does. Many data lakes are natively supported by Spark and [00:15:00] Ignite is supported by Spark. If you want to join two data sets, two tables, located in Ignite and data lake, go ahead and use Spark. To give you some blueprint and to be more specific on the way of thinking… When I used to talk to those architects, who were deploying this architecture, how would we decide, what data set should go in Apache Ignite and what definitely must remain in Hadoop or any other data lake.
The way [00:15:30] of thinking was as follows. You can continue using your data lake, such as Hadoop or anything else, as your primary storage for all historical data, especially if we’re talking about weeks, months and years. If you have any operations that it’s okay to complete in minutes or hours, keep using APIs of your data lake, do the batch processing. Don’t waste the time, resources and money [00:16:00] that you allocate for your in-memory cluster. As for Apache Ignite, that’s just storage for warm historical data. Warm means that you can keep the data for the latest days or for the latest week. It’s usually use case specific. I used to work with one bank was one well-known bank and they usually store in Ignite only the data sets for the latest three days, and then let’s say every third day they reload the data from data lake.
[00:16:30] Some other customers, they just store the latest week in Apache Ignite. That’s what happens. Also, you can always store the operational data for the current day, for the last, let’s say, minute to like hours, and you can merge all this data together. And when they’re talking about workloads, use your Apache Ignite cluster, if you need to get any operations or reports to complete in seconds milliseconds, or at least a minute or so. Usually, [00:17:00] those estimates, those requirements define the real time analytics. That’s why, how you can take your data lake, you’re not breaking it, you’re just improving your current application stack, your business solution by introducing in my module layer.
That would be the least stressful solution. It exists. Just move the data and decide to reach what the durations are to be fulfilled by Ignite and what are to be fulfilled by your data [00:17:30] lake. And again use Spark for the federated queries. You know the APIs you need to use with your data lake. Most of us that experienced Spark, let’s briefly check Ignite APIs that are relevant for real time analytics. I might be talking about some reports, you might be using [inaudible 00:17:51]. You might be using other BI tools, or you might be using, your Java or Python application would be connecting to Ignite [00:18:00] and want to query, do some calculations. What are the APIs? The first API is SQL. Ignite natively supports sequel.
It’s a standard implementation. Here is what I want to convey with this slide, is that when it comes to SQL, we are talking about SQL with joins, with grouping data and data aggregation. Mostly no limitation. It’s like allocate your data, run your query. For instance, what this slide shows, I’m [00:18:30] trying to find the most populated cities in countries such as Canada and France. All the data is properly stored across two node cluster. When you run this query, the query will be forwarded to those two cluster nodes. They will execute it over the local data set, and then the final results would be returned to your application, and the application would do the final merge and your application code, your clients would receive that results. This way, we are scaling your queries over the data that is [00:19:00] already properly distributed. By default, SQL APIs exist for Java, C++, .net, Python, etc. But also Ignite goes with JDBC and ODBC drivers that are required for tools such as Tableau.
But there is one problem with SQL, and sometimes this problem arises when you are running some analytical operations. I’ll give you an example. Let’s say that I’m a bank, and I’m storing millions of savings [00:19:30] accounts. Every month I need to do some standard operation. I need to traverse all those bank accounts and do some calculations and apply an interest rate. If you would be moving all the data with SQL etc. from your database to that application, to do some calculations and write everything back, you would be exhausting your network, and network is slow. It’s really easy to overkill all the benefits of your in-memory computing [00:20:00] database, cache, whatever, if you’re moving a lot of the data. That can happen if you use a not optimized SQL that transfers a lot of the data. If you pay attention to this slide, how big is the difference between disk I/O and network operations?
It’s like the difference between days and years. Yes, memory is much faster than disk, but disk is faster than network. The devil [00:20:30] is that all those in-memory computing systems or distributed databases, they are interconnected over the network. That’s why you need to know how to run and optimize your workloads properly. But that’s a topic for a different conversation. It’s so easy to do. One of the easiest ways, how you can address it… Let’s say you are doing that example, when you need to calculate, apply some interest rates to your millions of savings accounts, [00:21:00] you don’t actually need to pull or move any data from your in-memory database, to your application layer. Instead, you can create your custom task. The same way we deal with Hadoop MapReduce APIs, like Spark data frames, whatever. You write your custom logic, and you get this logic executed on your cluster node of Apache Ignite.
This bank account example. That’s a true story. I was involved [00:21:30] in that project. When the bank was moving data over the network, it was taking the bank two hours to complete that operation, just because the bank keeps dozens of millions of accounts. But when they switched to Apache Ignite, loaded all the data in-memory, and instead of using SQL, they created those computations that they were executing on the cluster. The calculation completed in 11 minutes. 11 minutes vs. two hours. The key to success [00:22:00] here was not just because the data was in memory, but because the data was in memory, plus they eliminated excessive usage of the network by using this MapReduce APIs. That’s actually what you usually need for analytics. If you see that your SQL does not perform well just because a lot of the data needs to be moved around a lot, consider [inaudible 00:22:18] computations as an option.
Machine learning is the same. We use machine learning frameworks heavily and a lot. With Ignite, if you need to train on your Ignite data [00:22:30] or you need to deploy any models, Ignite natively supports machine learning APIs, it’s also integrated with some of the tools. Check it, and you will be able not only to run SQL or compute over your in-memory data and Ignite, but also you can leverage the machine learning models.
Closing words. We had only 30 minutes and I want to keep some time left for the Q&A. [00:23:00] It [inaudible 00:23:02] analytics. What are the primary challenges? We all know those challenges, and I don’t want to be the captain of obviousity here. Massive volumes of data stored, many queries, and a lot of those operations that are executed these days, the businesses required them to execute in real time.
Real-time means, usually, in my case, when I used to work with our architects and developers, we are talking [00:23:30] about seconds, minutes and milliseconds. How can we improve our data lakes, that are already an excellent place where we can keep all the data? As an option, you can use in-memory computing. You don’t need to break things. You just can seamlessly and smoothly integrate your in-memory computing layer. For instance, you can consider that architecture that I showed you before, and once to do this, decide what data you need to move to your in-memory cluster. And that usually depends [00:24:00] on the operations you need to accelerate. After that, take advantage. As for the APIs, I shared with you some of the most prominent APIs, that are used for real-time analytics, at least with Apache Ignite. If you need to know more, you can explore. How you can explore Apache Ignite? That’s the website located on the Apache Infrastructure. Also there is some from GridGain, the company I’m working for, the company that donated Ignite. We are providing [00:24:30] some special integration layer for data lakes.
As for the community. It’s not only about, if you are curious and you’d like to contribute, and by contributing, you want to know how to develop and build distributed databases that work across memory and disk, how to develop distributed SQL engine, machine learning engines, come to our community. We are looking for contributors who want to gain more knowledge on that and become more [00:25:00] professional in that direction. Having said that, thanks for your time, and we are switching to the Q&A.
All right. Great. Thanks Denis. Fantastic presentation. If you’d like to ask Denis a question, you can either click on the button to activate your audio and video, or you can go ahead and ask it in the chat, and I will ask it to Denis myself. All right. First question. Does Ignite [00:25:30] support delete/update operations or append only?
Yes, it does. It does support update, delete, insert, and also, in addition to SQL APIs, you can use [inaudible 00:25:45] APIs. It’s like get put requests.
Second question. Does the SQL API overlap in functionality with something like Dremio in terms of accelerating querying data [00:26:00] lakes?
I think that those are just two different options, and probably they do not even compete. They just have something in common, at least, I guess, from the SQL syntax perspective, but from the usage perspective, there might be different scenarios. I was involved in many POCs and we have never came across a Dremio. They’re different for different use cases.
Yep. [00:26:30] From the Dremio side, I can say see the same thing. Definitely applied to different use cases. All right. Next question. What are the benefits of Apache Ignite in comparison to SAP HANA, besides being free?
The benefits of that? Yes, that’s an excellent question. Ignite is horizontally scalable. Correct me if I’m wrong, but with HANA there are some scalability [00:27:00] specificities. With Ignite, we have, for instance, if you talk about Apache Ignite, you can run it on any hardware you like, in any fashion. Let’s say, you have one of the customers, you have to throw in some names, Microsoft is using Ignite for Azure infrastructure, and they’re running four big clusters. Each cluster comprised of 400 nodes. At the same time we have other class customers, who are running just 20 or 30 nodes cluster. The beauty here is that the customers are not required [00:27:30] to migrate from one big machine appliance to another one. With Ignite you just scale out. Also, how does it compare… Storage engine, I think they’re quite comparable Ignite. I was talking about in-memory here, but Ignite can scale beyond memory capacity. Check transactional capabilities of Ignite and check how SQL compares and et cetera. Honestly, I was [00:28:00] checking feature by feature. I was doing feature by feature comparison with SAP HANA, probably like two years ago. Check and tell me what you think. You can always reach me out later.
Yeah. I’d be curious the answer there too. All right. Any more questions guys? We’re just about at time anyways. One more. Could you share some best practices to tune high running queries at Ignite?
[00:28:30] High running queries and Ignite? Probably you’re talking about SQL. To do that, let me just give you that pointer really quick. Start with the technical documentation that we have for Apache Ignite and SQL in particular, and then join our meetup and watch our YouTube channel, because we have already recorded a lot of the sessions related to not only SQL optimization, but other APIs. Also [00:29:00] there are usually many more on schedule. If you just sign up for the meetup, we have virtual Ignite meetup that is run by community members. That’s probably what you need to start with. If you are asking about SQL that’s the link.
Thanks Denis. That’s all the questions we have time for today. If we didn’t get to your question, you will have the opportunity to ask it in [00:29:30] Denis’ speaker channel in the subsurface [inaudible 00:29:33]. Before you go, we’d appreciate it if you would please fill out the super short Slido session survey. I can get that in the right order. In the top right there. That’s S-L-I-D-O, just click on that tab. Its super quick, and it provides us very valuable feedback on the type of content you’re interested in for the next conference. The next session is coming up in just five minutes. The expo hall is also open. I do encourage you to check out the booths to [00:30:00] get some demos on the latest technology and win some awesome giveaways. Thanks everyone, and enjoy the rest of the conference.
Thanks a lot. Bye-bye
And thank you to Denis. Great session.