9 minute read · August 15, 2019
Five Innovative Approaches To a Modern Data Platform
· Data Lake Platform Owner, Raiffeisenbank
This is the third article in the series of building the Modern Data Platform. The first article was on understanding the modern data platform and its objectives. The second article focused on challenges of lift-and-shift approach and benefits of cloud elasticity. In this article, I cover many other aspects of the modern data platform in the cloud.
To quickly refresh why lift-and-shift is often not a good approach and why we want to use cloud elasticity, consider this example. An on-prem Data Lake cluster runs on 100 nodes similar to AWS r5d.4xlarge. With the basic lift-and-shift approach, 100 nodes of this instance will cost you $1M per year. If you only use this cluster during 8 business hours, you can save $750,000 per year when using cloud elasticity properly.
Governance is Important
In addition to lift-and-shift, there are many other pitfalls with utilizing the cloud. Organizations usually and rightly so provide significantly more freedom to its employees to the cloud infrastructure as compared to the on-prem infrastructure. However, it sometimes leads to a non-efficient cloud usage, such as leaving compute instances running without a need, making countless copies of data, and many others. A proven way to approach this challenge is to implement cloud governance and ensure that expenses can be attributed to specific teams or projects and that the teams or projects are accountable with their reporting metrics. A governance lifecycle framework is another great way to review projects and provide feedback to the teams. Utilizing data federation and data virtualization tools combined with data discovery tools will further reduce the need for copying data and running ETL processes.
Security is another aspect of Governance that covers the cloud as well as the data platform. Many studies found the cloud to be more secure than corporate data centers. However, security requirements on-prem and on the cloud can be different. I have seen security folks transferring technical security requirements that were developed for on-prem solutions to the cloud. These requirements are not always transferable and sometimes pose a significant limitation on cloud adoption and on choosing vendors. Make sure that your security experts understand why certain technical security requirements are in place and if they are applicable in the cloud environment. Many times, your vendor might be helpful with the security review.
One of the common security attacks is to acquire AWS keys via Github or other sources and launch 100s or 1000s nodes on your account for bitcoin mining or other similar purposes. It’s amazing how many companies fell into that trap. An easy measure is to use cloud limits and reasonably limit IP space on the subnets to minimize damage in a case of data breach.
Data security is another aspect of Governance on the modern data platform. While it can stop some organizations from going to the cloud due to regulations, internal beliefs, or superstition, it usually can be achieved with proper use of cloud security capabilities, good governance process, and tools like Apache Ranger. However, keep in mind that data governance management should be centralized and still cover all tools that comprise your Data Platform.
Design Then Build
As with any other project, it’s important to design before building and it’s as important to not overengineer. Clouds provide many tools that can be used to do the same job in different ways. Don’t get stuck with analysis paralysis and do not create countless levels of indirection to isolate from vendors and tools. It is not only important to choose the right tool but also use it in the most simple way possible. There are some very complex solutions out in the wild for simple requirements. They are difficult to understand and to manage. The code and design reviews will help to address this challenge. Having expertise on your team for all your key tools is absolutely critical. Don’t be shy and hire consultants to fulfill the gaps in expertise.
The cluster elasticity will be limited by characteristics of the actual jobs. It’s the best practice to design, develop, and deploy your code and data in a way that enables linearly distributable processing. For example, well developed code that takes 1 hour to run on a 100 node cluster, should take 30 minutes on a 200 node cluster. It will allow you to accommodate for any SLA and help with implementing critical path in a data processing pipeline, a challenge that you will face sooner or later. While creating a well distributable job, you might have to overcome significant challenges, such as data skew.
Review, Optimize, Deploy.
It’s very important to put in place a process for productizing the code. As a general practice, most organizations implement CI/CD and the code sometimes goes into production right from the desk of a developer, whether it’s a data engineer or a data scientist. Too often, code review and code optimization is not given sufficient consideration. My team was assisting one of my customers and as a part of the Data Platform Health Check I reviewed a Spark-based ETL process that was executed hourly. After spending two hours on optimizing Spark job configuration, I was able to reduce the cost of this job by $1M a year! I had a similar experience with other customers. I hope it’s convincing enough to establish a proper productizing process.
Automation is King on The Cloud
The cloud infrastructure was designed from the ground up to be programmatically managed and automated. Automate code deployment, job schedules, auto-scaling, enforcement of governance rules, etc. Kubernetes is a great new tool available on any cloud. Without going into too much detail, it allows you to deploy your stack on any cloud which reduces switching cost and your dependency on a specific cloud vendor. To learn more about how you can deploy Dremio on the cloud using Kubernetes services, you can check our tutorials for Azure AKS as well as Amazon EKS.
Plan Ahead and Optimize Costs
While it might be usual to believe that the cloud has unlimited resources. I have seen many customers hitting actual cloud capacity limits in various regions. The best way to address this challenge is to make your clusters able to utilize heterogeneous resources. For example, you can make clusters use compute instances of various families and acquire those resources that are available.
Spot instances are one of the most misunderstood cloud capabilities. In some cases, spot instances can reduce your cost by 80% when used wisely. Make sure to mix them with standard on-demand instances, allow heterogeneous clusters, and utilize spot instances for processes that are not bound by extremely strict SLAs.
Another pitfall that I have seen often is transferring the traditional mindset to the cloud. It’s sort of a variation of lift-and-shift. For example, setting up cluster-local HDFS and using it for multi stage processes that depend on the data state in that HDFS makes this cluster a non-ephemeral resource that cannot be brought down. Cost is one outcome of that. Another outcome is the inability to update the software on that cluster as your vendor most likely did not accommodate for non-ephemeral usage. It’s not that rare to see a cluster that has been up for over a year.
Establishing a Center of Excellence with information on stakeholders, various best practices, information sources, communication plan, latest news, projects, schedules, and much more is a great way to enable your team on the cloud fast and efficient. CoE is not only a web portal, it’s a team of people and a set of processes. CoE is a very simple, inexpensive and efficient tool that is very often overlooked.
Old Doesn’t Mean Useless
Do not discard your existing on-prem data platform just yet. Keep using it for the appropriate use cases and consider it as a part of the modern data platform that you are building. However, do not let the cost of past investments prevent your organization from moving to the Cloud and jeopardize the future of your business.
How Dremio Can Help?
Dremio provides an integrated, self-service interface for data lakes. Designed for BI users and data scientists, Dremio incorporates capabilities for data acceleration, data curation, data catalog, and data lineage, all on any source, and delivered as a self-service platform.
Run SQL on any data source. Including optimized push downs and parallel connectivity to non-relational systems like Elasticsearch, S3 and HDFS.
Accelerate data. Using Data Reflections, a highly optimized representation of source data that is managed as columnar, compressed Apache Arrow for efficient in-memory analytical processing, and Apache Parquet for persistence.
Integrated data curation. Easy for business users, yet sufficiently powerful for data engineers, and fully integrated into Dremio.
Cross-Data Source Joins. execute high-performance joins across multiple disparate systems and technologies, between relational and NoSQL, S3, HDFS, and more.
Data Lineage. Full visibility into data lineage, from data sources, through transformations, joining with other data sources, and sharing with other users.
Visit our tutorials and resources to learn more about how you can gain insights from your data stored in ADLS, AWS, and more, using Dremio.