The Modern Data Platform Toolbox
The selection of tools today is the largest it has ever been, and the market space is very dynamic with new tools popping up almost every day. It’s impossible to consider all of them with any level of depth. One needs a framework to understand what tools need to be evaluated. I suggest starting from the functional view of your data platform. It may look similar to the diagram below. With the Storage and the Workloads separated, this diagram is based on the principle of decoupling data and compute. As we discussed in the previous article, this decoupling is the key to success with your cloud-based modern data platform.
The Right Tool for the Right Job
The data sources may be the IoT devices producing streaming data, OLTP databases with operational data that can feed fraud detection workloads or provide feedback to the machine learning models that are trying to detect behavioral patterns on your website or your mobile App. Your vendors may supply data in daily batches with various events from your marketing campaigns or with some reference data. This list of potential sources can go on and on. It’s important to understand what your data sources are, their nature, technology, integration mechanism, and priority for your business.
Data Consumers can also vary significantly. Business people and Data Analysts would like to consume data with tools like Tableau, Power BI, or Excel. Data Scientists are likely to use tools like R, Python, or others capable of running their Machine Learning (ML) models using various frameworks, such as TensorFlow and XGBoost. Your websites might need to consume ranking produced by serving pre-trained ML models. Fraud Detection consumers need timely information to be able to raise a useful alert. Marketing people are probably the ones that don’t let you sleep at night as they need near real time data from all data sources to drive consumer behavior.
Key Elements of a Modern Data Platform
The Data Platform is an environment in between data sources and data consumers that must be able to ingest all this variety of data at scale and produce desirable output data in the shape and time that is acceptable for consumers. The white blocks on the diagram are a set of capabilities that will enable data platform to do this job. The selection of tools should be such that it matches these functional blocks.
Now, that we know what we need to put in place and can start searching for the tools that can do the job in the most efficient way. I would like to list a few tools not only as an example, but also because I had exceptionally successful experience with them on many projects.
Let’s start from Storage. Remember, the data must be decoupled from compute. With a cloud-based data platform, the choice of storage is fairly simple. It’s S3 on AWS, ADLS on Azure, and GCS on the Google Cloud. For high performance use cases, there are other more expensive options.
Apache Spark is often a tool of choice for ETL and Data Science. It’s extremely flexible and tunable engine when used properly. It allows you to use SQL, Python, R, and Scala as programming languages. It can be easily integrated with Jupyter Notebook that came out as a data scientist tool and quickly became popular among data engineers as a development tool. Apache Spark has a modern, highly scalable architecture. It can consume a variety of data formats, and it’s highly extensible. With its powerful architectural capabilities, Spark is quickly replacing Apache Hive.
Let’s review a practical scenario that involves ML model training and model serving to produce marketing campaigns that is depicted on the diagram.
The input data comes from a vendor that tracks Clicks and Views generated by the marketing email in CSV files stored on S3 buckets. Information on viewing various website pages and adding products to the shopping cart comes in ORC files via legacy on-prem HDFS. Finally, data on actual orders and customer profiles comes from Oracle database. All this information allows data scientists to build and train various ML models with TensorFlow and XGBoost libraries running on Apache Spark cluster. Data scientists save trained models on S3 buckets for the CI/CD. These models are consumed by data engineers and ETL processes. Eventually, the ETL processes generate new marketing campaigns based on the ranked profiles for consumption by the marketing email vendor. Both, data scientists and data engineers use Jupyter Notebooks to develop their code. However, while data scientists prefer Python (PySpark) and R, data engineers are likely to use Scala. Note, that the entire process is supported by the Apache Spark cluster.
While Spark can also be used for ingesting streaming data or as a SQL Query engine for data analytics, there are many other excellent options on the market.
The data platform will likely need a Metastore to keep information about your data which is also known as metadata. Without a metastore, the data platform will be limited to a few specific data formats. For example, in the scenario above, Apache Spark would not be able to process ORC files without a metastore. While Apache Hive seems to be slowly displaced from the modern data platforms, the Hive Metastore is widely used as a metastore of choice. Interoperability is one of the metastore core capabilities. When choosing a tool for the metastore, make sure that it’s not locked by the vendor and can be integrated with all the tools that you choose for your data platform.
How Dremio Can Help?
Dremio is a unique tool that covers many functional areas on the data platform. Dremio is a SQL Query engine for the modern data platform that targets sub-second response performance while working with datasets at scale. It covers Data Federation and Data Virtualization scenarios. It provides Data Discovery, Data Lineage, and Data Governance capabilities. However, Dremio is not an ETL tool and it rather compliments Apache Spark.
Let’s review a typical scenario of business analytics platform in motion, when some data has to be migrated from on-prem HDFS to Cloud-based data lake. In a traditional environment, when Tableau or any other analytical tool is connected to the data source directly, without relying on data virtualization tools, moving a data source is a very impactful event. As depicted in the diagram below, the data consumers will need to update their reports and dashboards. You might also need a new data processing engine to serve SQL queries against data on the cloud data lake.
With Dremio, the impact of moving data sources can be reduced to minimal. As depicted in the diagram below, the only impacted layers will be virtual tables defined in Dremio. Note that moving data source has no impact to reports, dashboard, and data consumers when using Dremio.
Together, Apache Spark, Hive Metastore, and Dremio may cover much of the modern data platform core functionality.
Most of the modern tools including Apache Spark, Hive Metastore, and Dremio are open source. If creating data platforms is not a part of your core business, working on implementing open source tools from a scratch will keep you from the core business and consume valuable resources. Instead, it would be wise to develop a weighted criterion for various areas and functions that important to you and make a short list of vendors providing the tools that cover functional blocks in your diagram. That would help you choose tools and vendors that fit your needs in the most efficient way.