Subsurface Summer 2020
Reducing Time to Market with S7 Airlines Self Service Data Platform
S7 Airlines is the largest private airline in Russia, with the most modern fleet in the Russian air transit market. This session highlights how S7 reduced its product time to market and advanced data democratization by using cloud-based solutions and data governance in its own private cloud. Specifically, the presenter will detail how S7 reduced the amount of time consumed with discovery and preparation operations by consolidating data sources and enabling self-service data virtualization while increasing control over data access. Attendees will also learn how S7 enables data science teams to leverage the raw data stored in the data lake— eliminating the need to wait for data marts to be delivered in order to gain access to data.
Areg Azaryan, Enterprise Data Platform Product Owner, S7 Airlines
Areg Azaryan is a product owner for S7 Airlines, and is responsible for the enterprise data platform, which includes the data lake, data catalog, data bus and data virtualization models. He has over 8 years’ experience in analytics and data management in retail, banking, and aviation. He began his career as a business analyst/BI developer and became head of sales for a Top 2 FMCG retail company in Russia, in charge of forecasting development. He also worked for a bank as head of enterprise data warehouse, and was responsible for building EDW, data catalog and data governance policies within the organization.
Hi, everyone. Thanks for joining us for this session. Just a quick reminder that we will have live Q&A after the presentation. We do recommend activating your microphone and camera for the Q&A portion of the session. With that, please join me in welcoming our next speaker, Areg Azaryan, Enterprise Data Platform product owner from S7 Airlines. Let's get started. Areg, over to you.
Thank you, Louis. Hello, everyone. I would like to share with you some small presentation about what we did here in S7 to reduce our time to market for our products, using our Self-Service Data Platform on private cloud. Let's get started and I would like to start with some introduction of our company. You can see some numbers here. S7 is the second largest airline in Russia. We had 80 million passengers and 175 per day in 2019. You can guess why I didn't include numbers for 2020. We had over 1000 roads and 160 countries over 100 aircraft fleet. We have our own maintenance training and cargo companies and operating since 1992. As you can see, it's a pretty big business and it's a complex one. We have a lot of external business critical system and must exchange a lot of data with them. We also have to do a lot of analysis in order to keep track of our business, make personalized offers, improve passenger experience, plan new roads, do predictive maintenance, use dynamic pricing and et cetera.
Around 400 people in S7 work with data, like business analysts, data scientists, data engineers, and managers. Of course, there was some inefficiencies in the process of working with data. We started to investigate that. We found out there are some main difficulties in those process in our organization. You can see them on the screen. There was no place to easily find information about available data. There was some bottlenecks in providing data for users like streaming data, pipelines creation and data marts creation, or BI reporting. Teams providing those kinds of services was often overwhelmed with tasks from our data users and not always managing to provide results timely so we would like to help them. We have also a lot of sources and data warehouses. I'm going to talk about them further. Long data access approval process. We also didn't have a centralized and governance storage for all data we receive from external systems. Also, there was no centralized storage for streaming data archives. Those archives are done in products. Local storage is not shared between teams.
This was leading to duplicate it in various local storage's FTP service and data warehouses. Next, we also tried to build some kind of data user journey and get some information about timelines for each activity. Our basic process starts with searching the data within our organization. You can refer to documentation or you have to reach to people you know in order to find some info. There is no catalog or data assets and knowledge and no list of people responsible for each data domain or dataset. This activity can take up to three days to find data you need.
Once you know what dataset you need, you go and ask for the access to that data. The process of access approval can sometimes take up to two weeks if you need some protected data or trying to access some critical system. After you have access to the data, you maybe want to discover its structure and check if that's data fit unit. In some cases you have to return to previous steps if that's not the data you need. That can be painful really. In some cases, you have to do some preparation for your data. This also can be combined with transport of your data to your system. these two things combined can take two and a half weeks to do.
Often you need to control the quality of the data you receive. This might take up to weeks to develop all the rules and monitorings you need including the requirements gathering. After that, you use your data and maybe want to store it within your system, but then not always you want to store the whole archive. Maybe you need only a portion of data to do your daily calculations and don't want to lose history. Not every team has a large storage for that. Looking at the skim, we started to think about some solutions within Enterprise Data Platform. We plan to build, to help optimize those activities. Let's go one after one.
If we think about the data catalog, it can help us minimize time needed to find data within our IT landscape. Data lake can help us to store archive raw data from our external sources as some history data from our product storage's and the streaming data archive. If we store the data in the format needed and easy to access, our data scientists may go ahead and straight forward use the raw data they need, and don't want to wait for the preparation of the data. We also have data as a service model. It's needed to provide easy access to the data in our data lake and other sources for data discovery, preparation, and ad hoc usage. If users want to deliver data to the system, they can use our automated data bus with some set of predefined data quality checks. Of course, we need some data governance process to highlight the rules of working with data within our platform.
Let's take a look at the architecture we got now. We can start from the sources we have in S7. We have a lot of heterogeneous sources. We have internal business critical systems, and we often use change data capture, streaming and snapshot out of their databases. There's also a lot of business critical systems like booking system, departure control system, passenger service system, tickets sale aggregators, et cetera, which are located externally and we exchange data with them either with some services or in various files sense to FTP service. It can be batch or streaming data.
We also have our good old enterprise data warehouse with operational reporting and some domain data warehouses because we're in the process of switching to data mesh concept. For example, we already have built a customer domain data warehouse and flight domain data warehouse. We mainly receive Metadata from them and use it as life connect sources.
If you move forward, all of the data then answers to our platform, but before it goes to any system, there's set of data governance rules, that must be applied to it. For example, all of the all data which goes to data lake or data bus must be covered with some business Metadata, like a data owner, data steward, if it's a sensitive data or it's a personal data, which data domain, and sub-domain it refers to. Also, we cover it with some technical Metadata, like the format of the data, parents system, load time, et cetera, et cetera. All of that Metadata must be documented and then passed to data catalog. There's also an important policy like access control policy, which helps us to simplify, granting access to our users. As I said, any data which comes to system must be related to some business domain.
When our users ask access for our platform, he specifies some domains of interest. If he wants to see sensitive data, he will also provide approval from data owner and informational security team. When he gets the access, he will see all of the data in the platform which we have about the domain. No matter the source system, once the data added to the domain is a real feed. This simplifies the process of access gaining for our users dramatically. We also have some naming conventions and other convention rules and data quality processes, which helps us to keep our platform govern.
Let's move forward. Once all policies are met, our data goes to some life connects sources, go to our data as a service model. Our data as a service model, as you can see as implemented by Dremio. Dremio provides for us some data visualization features which can connect to our various sources without downloading data to some storage.
This helps us to easily access, discover and ad hoc join data in one place. All users had have to do is to get access to data as a service model and he has all of the sources in the palm of his hand. This data as a service model is also used for distribution point for our data lake, the main data warehouse and enterprise data warehouse. They distribute the data for the users to it because not all of them have their own analytical tooling and in Dremio you can connect to it using various interfaces like Web UI, Tableau connector, and other connectors like ODBC or JDBC, you can even manipulate with Dremio using API. All infrastructure in our data as a service model is aligned to our business data domain model.
Next [inaudible 00:12:39] raw data. This goes to our data lake. As you can see, our data lake is based on MinIO. MinIO is an open source, S3 compatible object storage and OpenShift container plat... we use also OpenShift as a container platform. In S7, we planned to switch from our old Hadoop clusters to S3. But in Russia, we have some regulation according to which we can store personal data of our citizens in foreign countries. We also have a lot of strict policies from our informational security team. That pretty much summarize why we choose to use a private cloud based solutions like MinIO and OpenShift. In data lake, we also have a strict structure of folders which corresponds with our naming policies and the main data model. We store our own data with minimum loss transformations and tend not to create complex data marts. Instead, we make sure we store data in easy to access format. Mostly, we use parkette files. For every data set we have in data lake. We also store file with Metadata which describes it.
There goes streaming data. The streaming data goes to our data bus. We use CAFCA with some important framework we've built. The main purpose of this framework was to automate every task related to CAFCA and integrated with our service desk. You can go to service desk and just create tickets for topics creation, access granting and monitoring, and the ticket is solved and then automatically applied to the cluster. We had some experience with CAFCA in S7 and there was a number of problems mentioned, but our users like access grant policies being slow topics, not categorized catalog, no additional information about existing topics. We use that knowledge and experience to build a more robust solution.
All the messages from data bus are archived to our data lake in order not to lose it and be able to keep history in case we need it. The last one about... Not the last, important one is the Metadata. It goes to our data catalog, actually data catalog is the starting point where all users should go. It's a compass to find your data. It's a robust tool for the enterprise and we're in the process of choosing the solution that will fit us right. That's why you see a question mark here. Despite there is no actual system right now, we started to collect Metadata for our catalog from the beginning of our platform.
Once we have a system in our house, there will be already some info to work with. We're also making an effort to collect all Metadata needed to build technical lineage because current solution can't automatically cope with some complex pipelines we have. We want to enrich the data lineage created by data catalog automatically. On the right-hand side, you can see the kind of consumers we have. We, of course, have our beloved users who use various analytical tools. We also massively use Tableau in our organization. You can also see that there are some set of data domains in enterprise data warehouse, which are also our consumers. They can store their whole data in our data lake, or even use a data lake plus our data bus to organize Lambda architecture using our services.
We're working with our Enterprise Data Platform since April. We already have some interesting business cases we've sold. Our marketing team alongside with our customer centric platform team use data lake and data as a service to prepare personalized offers to our passengers.
Our marketing team also use data lake to store survey data and data as a service to prepare customer satisfaction monitoring reports. All of the data from our oil management system is stored in data lake which helps us to build analytics and save money on storage. An interesting case during the COVID-19 pandemic, data as a service scale helped us a lot. We had to find new domestic roads to develop because all international flights was closed. It could have easily taken about a month to create pipelines, bring all the data together and create some data marts because it was stored in three or four systems. But our users managed to do all the analytics in a day because of the data virtualization.
We also centralized our binary aircraft data site and application data in data lake and this helped us to reduce our storage costs. I, of course, want to mention that data visualization itself provided some reduce in time taken to do ad hoc analysis for the various business analysis teams in our organization.
In average, we saw that works related to data search, data discovery and ad hoc analytics reduced in duration from 15% to 20% and this leading to reduce time to market for our products because most of the products, if not all products needs data to function properly.
In conclusion, let me give you some suggestions based on some experience we had when building our platform. First of all, I would like to suggest to involve your users early. Of course, you need to talk a lot with your users when collecting requirements, but you must not stop on that. You need to try and help your users solve as much business cases as you can and its sprint or release. Your users must get hands-on experience with your solution as soon as possible. This will give you lots of insights, both about features for your platform or fixes you need to do or even if you go in the right direction with the solutions you provide. This will also help you spread awareness about your product in your organization.
Second, I would like to suggest you to do even the self-service governance. If you don't want your platform or data lake to turn into a data swamp, you need to add some governance. This is obvious, but we also knew that if we had more governance, we will make our platform more inflexible. We didn't over govern. We just gave our users some simple one page manuals and rules about how they must work within our platform. We also included that in our training program.
One more important thing, we decided not to create a lot of policies and rules in the first place and after that, check how everything works. Now, we decided that first we see how our users work, and then we add some rules to guide them in the right direction.
The third one, keep track of the [inaudible 00:21:39]. Measure how many users you have, how often they use your platform, how many data they store or download from your system, how many jobs they run, or how many CPU RAM they use. Do user survey, collect feedback, ask your users how you've helped them solve business cases. Don't forget to calculate your costs timely. If you do that, you understand clearly business value you provide and you will know how your product must evolve.
That was all of my short presentation, you can see my contacts if you need to discuss something later and I'm ready to answer your questions. Thank you.
Okay. Thanks Areg. Now's your chance to ask regular questions live. If you do have a question, please use the button in the upper right to share your audio and video. You'll automatically be put in a queue. Let's see who do we have here? It looks like our first question, unfortunately... Jacon, do you want to try again, feel free to go ahead and add yourself to the queue again. Just again, there's a button in the top right there. That'll enables you to share your video and your sound. There we go. Hi, Krish. You want to go ahead and ask your question?
Hi. I was a little bit impatient. I was just wondering [inaudible 00:23:28].
Yeah. In our organization, mirror is provided as a part of our platform which provides some infrastructure as a service. Based on that, the overall storage right now in user space is 40 terabytes and the data lake uses about 20 terabytes of that. This the capacity.
Thank you everyone. Bye.