Table of Contents
Data Architects Modern Cloud Technology Stack
In today’s digital landscape, every company faces challenges including the storage, organization, processing, interpretation, transfer and preservation of data. Due to the constant growth in the volume of information and its diversity, it is very important to keep up to date and make use of cloud data infrastructure that meets your organization’s needs. The data architect is essential to this mission, as he or she can effectively solve issues related to:
- cloud data sources
- cloud storage
- data structure
- data processing
- scalability and availability
- speed of working with data
- and much more
Data architects must be able to seamlessly work with cloud data, software and databases, as well as build data infrastructure systems in the cloud that meet the needs of the business. They are also responsible for the design of the data model used for reporting and business solutions.
This article describes the technology stack that will help data architects cope with the demanding tasks of organizing an efficient and modern cloud data infrastructure.
Python is increasingly used for data analysis, both in science and in the commercial field. It is one of the fastest-growing programming languages due to its ease of use, large community and wide variety of open libraries for big data, data analysis, pipelines, database compatibility and much more.
A variety of tools for big data and data processing are written in Java. Almost all of them are open source projects, which means they are available to everyone. For this reason, they are actively used by IT companies around the world. For example, most Apache data engineering projects are written in Java and Scala.
Scala is an open source, multi-paradigm, high-level programming language with a robust static type system. Scala can interact with code written in Java but is more concise than Java. Several lines of code can be replaced with one word, which makes it much less verbose than standard Java. Apache Spark, the framework used for implementing distributed processing of unstructured and weakly structured data, is written in Scala.
R is a multi-paradigm interpreted programming language for statistical data processing. R supports a wide range of statistical and numerical methods and is constantly supplemented and expanded by packages—libraries for working with specific functions or special applications. R is efficient for big data and machine learning.
Although SQL is not a full-fledged programming language, it is used by everyone who works with data. SQL is, first of all, a domain-specific language intended for the description, modification and retrieval of data stored in relational databases. Some cloud data sources do not support SQL queries, but there are solutions that allow for the use of SQL as a single query language for many different data sources in the cloud data lake. One of these solutions is Dremio. With Dremio, you can connect to several cloud data lake storage sources and implement normal SQL queries to work with them. Dremio can also be deployed directly to AWS (Dremio AWS Edition), which makes working with Amazon S3 even easier.
Cloud computing is when computing is carried out in a ready-made infrastructure with access through the internet. Infrastructure can consist of hundreds or thousands of computing nodes. All these nodes are connected to a single network, effectively functioning as one large computer. Cloud computing allows businesses to develop virtualized IT infrastructure and deliver software through the cloud, independent of a user’s operating system.
There are different ways to store data in clouds. It can be based on relational or non-relational logic of working with data.
Cloud Data Lakes
Often, especially in big data, data comes from different sources and varies in structure and format. For efficiency, it is ideal to have a single store for all of the raw data in the cloud so anyone in an organization can have access to it. This method of storing data is referred to as a “cloud data lake.” Cloud data lakes are convenient for purposes like analytics since they allow you to store large amounts of data in their original form, giving you easy access to all the data you need in one place. Cloud data lakes can be used to store data generated from internal actions of the organization and data collected from external sources.
There are different models of cloud data lake infrastructures. Organizations must decide which type of environment works best for the business.
Cloud Data Lake (public cloud)
Public cloud is where the infrastructure is designed to provide services to third-party customers. For example, Amazon Web Services (AWS), Google Cloud and Microsoft Azure Cloud. The advantages of this model include eliminating the need for service personnel, accessing rapidly expanding resources and paying only for what you consume—enabling you to focus on the business, rather than maintaining the infrastructure.
Nowadays, more and more companies are choosing public cloud data lakes. This technology provides organizations with the ability to store any amount of data, choose between different pricing options, access data for every unit of an organization, maintain data security and take advantage of nearly unlimited scalability.
Hybrid Data Lake – Hybrid cloud environments are models where a private cloud solution is combined with public cloud services. This architecture is often used when an organization needs to store sensitive data in the private cloud but wants employees to access applications and resources in the public cloud. Hybrid data lakes combine all the advantages and disadvantages of private cloud and public cloud.
Nowadays, cloud computing is not only about data storage. Cloud computing also includes cloud services provided by service providers. Cloud services are available to customers from the provider’s servers, so there’s no need for a company to host the applications on its own on-prem servers. These fully managed services are designed to provide easy access to applications and resources, without the need for internal infrastructure or hardware.
There are several models of cloud services:
- IaaS (Infrastructure-as-a-Service) – The provider delivers computing resources such as storage servers and networking hardware, and may also offer load balancing, application firewalls and more. Network connectivity between virtual machines and additional disk resources are also provided. A separate part of IaaS is MaaS (Metal as a Service)—providing access to bare hardware without OS (bare-metal servers).
- PaaS (Platform-as-a-Service) – This is a cloud service model that provides a database, an operating system and certain software components for independent implementation by the customers of their cloud-based services.
- SaaS (Software-as-a-Service) – This is one of the most commonly used cloud services. Providers supply their customers with web-based software. SaaS provides file storage, backup, web-based email, BI tools and many others.
Stream Processing Platforms
One of the best-known characteristics of information in our time is its continuity. Most data is created in the form of continuous flows. Data is generated from sensors, mobile devices, user activity, machines, networks, applications, IoT and more. To maximize the benefits of constantly generated data, organizations must have real-time data analysis tools in place. There are various types of stream processing platforms available, all with different features and advantages. Businesses can choose the stream processing platform that best meets their unique needs.
Modern stream processing platforms typically have the following characteristics:
- Responds to events instantly via applications and analytics
- Maintains own data and state, removing the need for large and expensive databases
- Handles big volumes of data
- Ensures the continuous and timely nature of data—data is processed continuously, and not on a periodic basis
- Delivers anytime, anywhere access
Below are a few examples of existing stream processing platforms.
Apache Spark is an open source framework for implementing distributed processing of unstructured and weakly structured data. Unlike legacy Hadoop core processors, which implement the two-level MapReduce concept by storing intermediate data on drives, Spark processes data in cloud data lake storage in-memory, which makes it possible to gain significant speed for some use cases. In addition to its in-memory processing, graph processing and machine learning features, Spark can also handle streaming data.
Apache Flink is an open source threading framework developed by the Apache Software Foundation. Flink supports programming data streams both in parallel mode and in pipeline mode. Flink has high bandwidth and low latency. Streams can be activated from events and maintain status. Tasks in Flink are fault-tolerant.
Apache Storm is a distributed real-time computation system. It can be used for real-time analytics, machine learning, continuous computation and more. Apache Storm was designed to be used with any programming language.
Amazon Kinesis is a cloud service used for processing a large number of distributed data streams in real time. The service is part of the AWS infrastructure. It allows developers to extract any amount of data from any number of sources, increasing or decreasing the number of sources as needed. It has some similarities to Apache Kafka as far as functionality is concerned. You can read more about Amazon Kinesis in this Dremio article, Data Preprocessing in Amazon Kinesis.
Apache Kafka is a distributed software message broker. It allows systems that generate data to save their data in real time in a Kafka topic. Any topic can then be read by any number of systems who need that data in real time. Systems that generate data are called producers and systems that request information are called consumers. The main reason for using Kafka is the belief that consumers will receive messages from producers without fail. Kafka is often used in real-time data pipelines.
Data Lake Engine
After the implementation of a data storage model suitable for your organization, for example, a cloud data lake, you also need to ensure that your organization has quick and efficient access to data. The layer between the cloud data storage (the cloud data lake) and the BI or data science system is called the data lake engine. This is another technology that a data architect should be familiar with.
Data is often stored on multiple systems or in non-relational cloud stores, such as Microsoft ADLS, Amazon S3 and NoSQL databases, but consuming data from the cloud data lake can be difficult and time consuming. As a result, companies must copy the data to data warehouses, create cubes, and create unique extracts depending on the use case they are trying to address and also the tool they are using to analyze data. Cloud data lake engines simplify this process by allowing companies to leave data where it is already stored and provide quick access for data consumers.
Dremio is a powerful cloud data lake engine that makes data accessible to analysis tools at interactive speed without copying or moving data. It was designed for data consumers and data architects. Dremio provides a user-generated semantic layer with an integrated, searchable catalog that indexes all metadata so business users can easily make sense of all their data. It can connect to any BI or data science tool. The engine offers ANSI SQL capabilities, including complex joins across data sets in the data lake, large aggregations, common table expressions, sub-selects, window functions and statistical functions. It also makes data interaction up to 100X faster than traditional SQL engine approaches thanks to data reflections, which maintains one or more physically optimized representations of a dataset in memory. Data reflections are transparent to end users, so they can be added and revised without changing the SQL of client applications.
Pandas is a library for data processing and analysis for the Python programming language. Pandas provides special data structures and operations for manipulating numerical tables and time series. The main function of Pandas is to provide work within the framework of the Python environment not only for data collection and purification but for data analysis and modeling tasks as well. The library is primarily intended for the cleaning and initial assessment of data by general indicators, for example, average value, quantiles and so on. It is not a complete statistical package, however, datasets of the DataFrame and Series types are used as inputs in most data analysis and machine learning modules.
Containerization is a standard unit of software that packages up code and all its dependencies so applications run quickly and reliably from one computing environment to another. Docker is software used for automating the deployment and management of applications in containerized environments. It allows you to “pack” the application with all its surroundings and dependencies into a container so that it can be ported to any system, and also provides a container management environment.
The role of a data architect in different organizations can vary, but their common goal is to provide an architecture that facilitates the most efficient transfer, processing and storage of data. An architect must work with business stakeholders and development teams to guide development efforts.
It is important to keep up to date and implement modern technologies that will give your organization a competitive edge. More and more companies today are using cloud computing. No matter what data you need to store, cloud data lakes will do it in the right way. A cloud data lake will always fit the size of your company and scale with it. And the use of a cloud data lake engine will provide a completely different and efficient approach to data analysis. You no longer need to think about data formatting and compatibility cloud data lake engines provide fast data transfer for analysis, regardless of the data storage model and BI tools you’re using.