What Is Data Engineering?
Data is at the center of every business today. Companies use data to answer questions about the business, such as:
- What’s a new customer worth?
- How can I improve our website?
- What are the fastest-growing product lines?
In most organizations, many different systems generate data, and each system typically uses different technology and has a distinct owner within the organization. For example, consider data about customers:
- One system contains information about billing and shipping
- Another system maintains order history
- And other systems store customer support, behavioral information and third-party data
Together, this data provides a comprehensive view of the customer. However, these different datasets are independent of one another, which makes answering certain questions — like what types of orders result in the highest customer support costs — very difficult. Data analysis is challenging because the data is managed by different technologies and stored in various structures. Yet, the tools used for analysis assume the data is managed by the same technology, and stored in the same structure.
Companies of all sizes have huge amounts of disparate data to comb through to answer critical business questions. Data engineering is designed to support the process, making it possible for consumers of data, such as analysts, data scientists and executives to reliably, quickly and securely inspect all of the data available.
Data engineering helps make data more useful and accessible for consumers of data. To do so, ata engineering must source, transform and analyze data from each system. For example, data stored in a relational database is managed as tables, like a Microsoft Excel spreadsheet. Each table contains many rows, and all rows have the same columns. A given piece of information, such as a customer order, may be stored across dozens of tables.
In contrast, data stored in a NoSQL database such as MongoDB is managed as documents, which are more like Word documents. Each document is flexible and may contain a different set of attributes. When querying the relational database, a data engineer uses SQL, whereas MongoDB has a proprietary language that is very different from SQL. Data engineering works with both types of systems, as well as many others, to make it easier for consumers of the data to use all the data together, without having to master all the intricacies of each technology.
For these reasons, even simple business questions can require complex solutions. Working with each system requires understanding the technology, as well as the data. Once data engineering has sourced and curated the data for a given job, it is much easier to use for consumers of the data.
Trends & History
As companies become more reliant on data, the importance of data engineering continues to grow. Since 2012 Google searches for the phrase “data engineering” have tripled:
Google searches for Data Engineering. Source: Google Trends.
And in that time, job postings for this role have also increased more than 50%. Just in the past year, they’ve almost doubled.
Data engineering job listings. Source: Dice Tech Job Report 2020
Why?
- There’s more data than ever before, and data is growing faster than ever before. 90% of the data that exists today has been created in the last two years.
- Data is more valuable to companies, and across more business functions—sales, marketing, finance and others areas of the business are using data to be more innovative and more effective.
- The technologies used for data are more complex. Most companies today create data in many systems and use a range of different technologies for their data, including relational databases, Hadoop and NoSQL.
- Companies are finding more ways to benefit from data. They use data to understand the current state of the business, predict the future, model their customers, prevent threats and create new kinds of products. Data engineering is the linchpin in all these activities.
- As data becomes more complex, this role will continue to grow in importance. And as the demands for data increase, data engineering will become even more critical.
Data Engineering Responsibilities
Data engineering organizes data to make it easy for other systems and people to use. They work with many different consumers of data, such as:
- Data analysts who answer specific questions about data, or build reports and visualizations so that other people can understand the data more easily.
- Data scientists who answer more complex questions than data analysts do. For example, a data scientist might build a model that predicts which customers are likely to purchase a specific item.
- Systems architects who are responsible for pulling data into the applications they build. For example, an e-commerce store might offer discounts depending on a user’s purchase history, and the infrastructure for calculating that discount is built by a systems architect.
- Business leaders who need to understand what the data means and how others will use it.
Data engineering works with each of these groups and must understand their specific needs. Responsibilities include:
- Gathering data requirements, such as how long the data needs to be stored, how it will be used and what people and systems need access to the data.
- Maintaining metadata about the data, such as what technology manages the data, the schema, the size, how the data is secured, the source of the data and the ultimate owner of the data.
- Ensuring security and governance for the data, using centralized security controls like LDAP, encrypting the data, and auditing access to the data.
- Storing the data, using specialized technologies that are optimized for the particular use of the data, such as a relational database, a NoSQL database, Hadoop, Amazon S3 or Azure blog storage.
- Processing data for specific needs, using tools that access data from different sources, transform and enrich the data, summarize the data and store the data in the storage system.
To address these responsibilities, data engineers perform many different tasks. Some examples include:
- Acquisition: Sourcing the data from different systems
- Cleansing: Detecting and correcting errors
- Conversion: Converting data from one format to another
- Disambiguation: Interpreting data that has multiple meanings
- De-duplication: Removing duplicate copies of data
It is common to use most or all of these tasks for any data processing job.
Types of Data Sources
Companies create data using many different types of technologies. Each technology is specialized for a different purpose — speed, security and cost are some of the trade-offs. Application teams choose the technology that is best suited to the system they are building. Data engineering must be capable of working with these technologies and the data they produce.
Data Source | Applications | Data Structures | Interface | Vendors |
Relational databases | (operational) HR, CRM, financial planning | Tables | SQL | Oracle, Microsoft SQL Server, IBM DB2 |
Relational databases (analytical) | Data warehouses, data marts | Tables | SQL | Teradata, Vertica, Amazon Redshift, Sybase IQ |
JSON databases | Web, mobile, social | JSON documents | Proprietary language | MongoDB |
Key-value systems | Web, mobile, social | Objects | Proprietary language | Memcached, Redis |
Columnar databases | IoT, machine data | Column families | Proprietary language | Apache Cassandra, Apache HBase |
File systems | Data storage | Files | API | Hadoop Distributed File System (HDFS) |
Object stores | Data storage | Objects | API | Amazon S3, Azure Blob Store |
Spreadsheets | Desktop data analysis | Worksheets | API | Microsoft Excel |
Companies also use vendor applications, such as SAP or Microsoft Exchange. These are applications companies run themselves, or services they use in the cloud, such as Salesforce.com or Google G Suite. Vendor applications manage data in a “black box.” They provide application programming interfaces (APIs) to the data, instead of direct access to the underlying database. APIs are specific to a given application, and each presents a unique set of capabilities and interfaces that require knowledge and following best practices. Furthermore, these APIs evolve over time as new features are added to applications. For example, if your CRM application adds the ability to store the Twitter handle of your customer, the API would change to allow you to access this data. Data engineers must be able to work with these APIs.
Key Data Engineering Skills and Tools
Data engineers use specialized tools to work with data. Each system presents specific challenges. They must consider the way data is modeled, stored, secured and encoded. These teams must also understand the most efficient ways to access and manipulate the data.
Data engineering thinks about the end-to-end process as “data pipelines.” Each pipeline has one one or more sources, and one or more destinations. Within the pipeline, data may undergo several steps of transformation, validation, enrichment, summarization or other steps. Data engineers create these pipelines with a variety of technologies such as:
- ETL Tools. Extract Transform Load (ETL) is a category of technologies that move data between systems. These tools access data from many different technologies, and then apply rules to “transform” and cleanse the data so that it is ready for analysis. For example, an ETL process might extract the postal code from an address field and store this value in a new field so that analysis can easily be performed at the postal code level. Then the data is loaded into a destination system for analysis. Examples of ETL products include Informatica and SAP Data Services.
- SQL. Structured Query Language (SQL) is the standard language for querying relational databases. Data engineers use SQL to perform ETL tasks within a relational database. SQL is especially useful when the data source and destination are the same type of database. SQL is very popular and well-understood by many people and supported by many tools.
- Python. Python is a general purpose programming language. It has become a popular tool for performing ETL tasks due to its ease of use and extensive libraries for accessing databases and storage technologies. Python can be used instead of ETL tools for ETL tasks. Many data engineers use Python instead of an ETL tool because it is more flexible and more powerful for these tasks.
- Spark and Hadoop. Spark and Hadoop work with large datasets on clusters of computers. They make it easier to apply the power of many computers working together to perform a job on the data. This capability is especially important when the data is too large to be stored on a single computer. Today, Spark and Hadoop are not as easy to use as Python, and there are far more people who know and use Python.
- HDFS and Amazon S3. Data engineering uses HDFS or Amazon S3 to store data during processing. HDFS and Amazon S3 are specialized file systems that can store an essentially unlimited amount of data, making them useful for data science tasks. They are also inexpensive, which is important as processing generates large volumes of data. Finally, these data storage systems are integrated into environments where the data will be processed. This makes managing data systems much easier.
New data technologies emerge frequently, often delivering significant performance, security or other improvements that let data engineers do their jobs better. Many of these tools are licensed as open source software. Open source projects allow teams across companies to easily collaborate on software projects, and to use these projects with no commercial obligations. Since the early 2000s, many of the largest companies who specialize in data, such as Google and Facebook, have created critical data technologies that they have released to the public as open source projects.
Data Engineering vs. Data Science
Data engineering and data science are complementary. Essentially, data engineering ensures that data scientists can look at data reliably and consistently.
- Data scientists use technologies such as machine learning and data mining. They also use tools like R, Python and SAS to analyze data in powerful ways. These technologies assume the data is ready for analysis and gathered together in one place. They communicate their insights using charts, graphs and visualization tools. Data scientists must be able to explain their results to technical and non-technical audiences.
- Data engineering uses tools like SQL and Python to make data ready for data scientists. Data engineering works with data scientists to understand their specific needs for a job. They build data pipelines that source and transform the data into the structures needed for analysis. These data pipelines must be well-engineered for performance and reliability. This requires a strong understanding of software engineering best practices. Data engineering also uses monitoring and logging to help ensure reliability. They must design for performance and scalability to work with large datasets and demanding SLAs.
Data engineering makes data scientists more productive. They allow data scientists to focus on what they do best: performing analysis. Without data engineering, data scientists spend the majority of their time preparing data for analysis.
Getting Started
With the right tools, data engineers can be significantly more productive. Dremio helps companies get more value from their data, faster. Dremio makes data engineers more productive, and data consumers more self-sufficient.
Additional Resources
Data Engineering Modern Cloud Technology Stack
In today’s digital landscape, every company faces challenges including the storage, organization, processing, interpretation, transfer and preservation of data. Due to the constant growth in the volume of information and its diversity, it is very important to keep up to date and make use of cloud data infrastructure that meets your organization’s needs.
Learn MoreFunctional Data Engineering - A Set of Best Practices
In this talk, we’ll discuss the functional programming paradigm and explore how applying it to data engineering can bring a lot of clarity to the process. It helps solve some of the inherent problems of ETL, leads to more manageable and maintainable workloads and helps to implement reproducible and scalable practices. It empowers data teams to tackle larger problems and push the boundaries of what’s possible.
Watch the VideoModern Architecture for Comprehensive BI and Analytics
How can modern enterprises scale to handle all of their data? Tomer Shiran, cofounder and CEO of Dremio, told Upside why he thinks it’s all about the data lake.
Read the Article