March 2, 2023

11:45 am - 12:15 pm PST

From Business Intelligence to Streaming Decisions

Twenty years ago, we had a firm desire to support customers with business intelligence and open source solutions. At that time, there was opposition to the existence of such solutions as well as proprietary software lobbyists focused on curbing free software at all costs.

Today, the open architecture of the data lakehouse allows our customers to keep real control of their data, and supports user empowerment in the appropriation of data to make daily decisions.

Modern business processes are increasingly designed for analytics; therefore, databases are dumped to the data lake, via Debezium through Apache Kafka to Apache Iceberg, to be consolidated and queried via Dremio. Such an infrastructure opens new perspectives to take full advantage of open architectures.

Session ID: BO109

Topics Covered

Lakehouse Architecture

Sign up to watch all Subsurface 2023 sessions

Transcript

Note: This transcript was created using speech recognition software. It may contain errors.

Charly Clairmont:

Okay. I’m Charly. I cofounder an IT personal company in 2004. sytech. Our mission is to help our customer to, to create data-intensive application. I will better introduce myself. Next slide. I want to, to thanks Romeo and serve so fast team to, to accept my talk and proposal. I’m very happy to, to, to be there.

 today I want to, to share some view and, and opinion about data platform, data project that I could, you, I will explain quickly how we started. I will focus on open source for me is where innovation are, are born. I will next focus on architecture and show data and show our data architecture could, could be simple. And finally I will show you an example of architecture with dmu to deliver your insight with a swimming manual. by the way I will be in, in the term of my talk from business intelligence to swimming decision to introduce myself. I, I, I started in a sport media startup. in this young career. I quickly face heavy load on, on, on the website of the company and, and, and the application that we put in, in production. After that, I develop Java application. I work with EOP p Solution. and finally after three year of this short career we started our company in Arctic and focused mainly on business intelligence and data integration. today we participate we participate with the French Ministry of the Youth and sports to create the first global directory of association or foundation data.

Okay. How we started since the beginning of tic we, we want to deliver performance BI architecture with respect best practices. I spent some weeks understanding all the concept that Wal Kimberly introduced with building Data House. today, data House is not based on the same architecture as before. Kimberly Modernization is not so to the in cloud data house. fact, France she share a benchmark about star shima where they, they show that it’s not so appropriate in in in cloud data house. but we, we have to know that Kimball presented concept like semi think you have to, to, to know to to be relevant in your data platform and data ization. in the beginning we, we met some consultant from big big from the big four who explained us that open source business intelligence will not exist.

 they were very clear variant. around 2004 of 2007, many, many open source business intelligence components were available to deliver data project. At tic, we want our customer and their user to be educated with the data culture to get more and more value. So, open source give us the, the this ability to focus on technology, but also on the culture of data. in fact platform like Spago, just server Pena package, open source component to offer this global balancer to our, to our customer need. a big sense to Julian ide, who built the, the . he also the creator of AAL site, which is one of the most embedded library to, to build esque today. today we, we we talk a lot about, about data ish. but we came from software development integration. So organize plan a project by domain with dependency contracts is our, our to today our data project since 2012. we have values integration to, to build. In fact, our, our data project for us table software is very full to, to achieve a such approach.

Hmm, okay. opensource is is one of the game changer, but we can also observe that ao finally introduce the separation of the compute and the storage thing like sh also the cloud is, is bring us the flexibility. We, we, we need also get something like the, the immediate arch immediate architecture. You, you, you go to cloud and you take your, your, your service and, and vo cloud storage. Also with a new data it’s also cheaper and become a universal storage today. And we, we could also be quite we could also said that the, the performance of the cloud data sewage are bluffing. swimming also give us a way to, to sync again how we transport data self-service key factor and tableau open the door of the self-service. now the motor is to, is give business users the control of the data.

Okay? change the thing today. And all is very simple. also solution and tool and method. We are very resilient for our activity. But at all, it’s quite complicated. data federation as introduce it. it’s a game changer too, with approach. with acceleration, we invent how we could manage our data. Project renew is the foundation of the data. Mesh all member of the data community of an organization could fully achieve their objective with domain space, virtual data sets, which is the data contract. Dou propose a cool refresh of data management. If we faced objective of organization data driven data give business users, the control of of data, data is placed ultimately as the earth of the information system. In fact, data is a new oil with a, with that kind of architecture is a simple architecture.

Now with the Lake House dream open a new perspective. Since last week on the homepage of D reu, we have this kind of shimer, I receive it like a present [inaudible] give, give us a perspective on how do you become versatile does not replace the component of also data management stack. Instead, you, them accelerate inside share and secure data accessed. And you can also now with Apache, build some ecl pipeline create 6 360 degree customer data platform. You can provide verified dataset for your machine learning, by the way allowed us to, to have a very simple architecture to build our data platform.

Okay? Okay. Te it, it was Terry. Now let’s talk about practice. I will give ear some feedback on how we use Watson. I will start with a simple BI architecture as a a first stack. I will continue with a use case where we need to manage historical data here. And I will end with how you can employ to u in a streaming approach. Here, our customer develop an inhouse application based on four main domain. they cannot get a global vision of their activity. They have four post square qdb with simple setup to u in in the middle press, the, the object switch and vo in one month, our customer get the inside. They, they want to follow our customer, also open the, the lakehouse to its project owner and the, the build data datadriven approach in their activity.

Okay? another, another stack. we, we start with external query close of drill in the [inaudible] to continue providing the same queries and the same result the customer already have with his last analytical pipeline. to, to manage state, we need to handle historical data. We first encourage the create a stable command and because we need to schedule those pipeline we create a airflow operator following you. and now we start developing talent component for su DBC bridge the, the test, our conclusive. Okay, here, some, some screenshot of the operator. okay. Now I will show you a demo of that stack. imagine you have value store and you need to, to have the, the perspective or the vision of your last inventory can could use in fact that stack with Division Kafka Kafka connect to deliver your, your insight. Okay? Quick demo.

Okay. here I start. I show you the, the store. okay, we, it’s margin two as you understand, and the demo of margin two. we will connect on the, on the, on the back end. go, go to the project, project page control for, for, forget the, the quantity. We, we, we have already in, in the scenario, we imagine a warehouse manager share with us the last inventory fight. so we have to upload it in our, in, in national two. and, but before that, I will explain and show you the . okay. so we will, okay.

Okay. so before we division will listen the always change in the inventory table after that division will deliver the, the data to Kafka. Finally finally, Kafka will final Kafka with a special Kafka connect whose name is Kafka Connect Iceberg. we will send the data to as we’ll send the data to the, to the cloud storage here. We, we use Min for, for the ex example. okay, but now we, we, we try to, to show, show how we, we, we do that. So we import the data, the file of the, the file of the eventually in margin two. we choose the add date features. Okay? Now we check the data if the file it’s in a good quality. Okay? It’s okay. So now we could import we start the importation. We fill the, the, the table with a new value. And we will check, we have the, the, the good quantity, the new quantity in in . Okay? We have now, so three, okay? For all, it’s an example. And we could continue now and check if we have the, the data now in Kafka and validate the, the, the component works fine. Okay? We go to message, message once again.

Okay? We get it. so in the massage, we, we, we will show that we have the value before, okay? One and red here. And okay, the, the, the, the payload of of the museum. We have the, the new quantity here. Okay? So we continue the, the, the pipeline now with Kaka Connect Berg, okay? And we will we’ll go to the . So we are in the discussion of, and we could we will pause the content. Okay? Here we are. We are, we have our data here. And we know the iceberg organization. So we have data. We spark a file and metadata here. So the connector works fine. And now we, we, we go to to, and with, we will have the to format, okay? We are, we go to data lake. We, we pause our content. We format the, the table, the table table to have the PDFs, PDFs in en Vora. We could get our, our data. Now inside of inside of what is is interesting here is that we, we, we, we set up a quick environment, a quick stack where we could get the, the insight of of, of data. Oops,

, sorry. okay, sorry. What you, you could keep in mind is that you could simply build architecture for, for, for, for to swim your decision. it’s very simple now smoother, that kind of of stack. and then you also with this G D B C bridge could, could be could be used to, to fill the data directly. and Theo will, will also be the, a kind of proxy to your iceberg table in something we, we always test with a talent, talent component. So we al with all we present today, we will focus in helping help dream you to be more ecosystem already with connector like the airflow talent talent component. And also the connector, Kafka connector as well, is that getting data create and it’ll be interesting to, to have that, to, to stream your data. Okay? okay. Thanks a lot. Every, everyone to, for your attention. And thanks again dream you and sub fast team for its support and, and, and this nice organization.

header-bg