Dremio Jekyll

Data Engineering Explained by Dremio

Data is at the center of every business today. Companies use data to answer questions about their businesses, such as:

  • What’s a new customer worth?
  • How can I improve our website?
  • What are the fastest-growing product lines?

In most organizations, many different systems create data. Each system may use a different technology, and each has a distinct owner within the organization. For example, consider data about customers:

  • One system contains information about billing and shipping.
  • Another system maintains the history of orders.
  • And other systems store customer support, behavioral information, and third-party data.

Together this data provides a full understanding of the customer. However, these different data sets are independent of one another. This makes answering certain questions – like what types of orders result in the highest customer support costs – very difficult. This kind of analysis is challenging because the data is managed by different technologies and stored in various structures. Yet, the tools used for analysis assume the data is managed by the same technology, and stored in the same structure.

Even relatively small companies might have millions of customers, and huge amounts of data to comb through to answer these questions. Data engineering is about supporting that process – making it possible for consumers of data, such as analysts, data scientists, and executives – to reliably, quickly, and securely inspect all of the data available.

Diagram showing how data engineers move data between where it is created, and data analysts and scientists.

Data Engineering helps make data more useful and accessible for consumers of data.

Data engineering must source, transform, and analyze data from each system. For example, data stored in a relational database is managed as tables, like an Excel spreadsheet. Each table contains many rows, and all rows have the same columns. A given piece of information, such as a customer order, may be stored across dozens of tables. In contrast, data stored in a NoSQL database such as MongoDB is managed as documents, which are more like Word documents. Each document is flexible and may contain a different set of attributes. When querying the relational database a data engineer would use SQL, whereas MongoDB has a proprietary language that is very different from SQL. Data engineering works with both types of systems, as well as many others, to make it easier for consumers of the data to use all the data together, without mastering all the intricacies of each technology.

For these reasons even simple questions can require complex solutions. Working with each system requires understanding the technology, as well as the data. Once data engineering has sourced and curated the data for a given job, it is much easier to use for consumers of the data.

As companies become more reliant on data, the importance of data engineering continues to grow. Since 2012 Google searches for the phrase have tripled:

Data Engineering search trends

Google searches for Data Engineering. From Google Trends.

And in that time, job postings for this role have also increased more than 400%. Just in the past year, they’ve almost doubled.

Data Engineering job trends

Data Engineering job listings. From indeed.com.

Why?

  • There’s more data than ever before, and data is growing faster than ever before. More data was created in the past two years than all prior years combined.
  • Data is more valuable to companies, across more business functions. Sales, marketing, finance and others areas of the business are using data to be more innovative and more effective.
  • The technologies used for data are more complex. Most companies today create data in many systems. They also use a range of different technologies for their data, including relational databases, Hadoop, and NoSQL.
  • Companies are finding more ways to benefit from data. They use data to understand the current state of the business. They are also using data to predict the future, model their customers, prevent threats, and create new kinds of products. Data engineering is the linchpin in all these activities.

As data becomes more complex, this role will continue to grow in importance. And as the demands for data increase, data engineering will become even more critical.

Data Engineering Responsibilities

Data engineering organizes data to make it easy for other systems and people to use. They work with many different consumers of data, such as:

  • Data Analysts answer specific questions about data, or build reports and visualizations so that other people can understand the data more easily.
  • Data Scientists answer more complex questions than data analysts do – for example, a data scientist might build a model that predicts which customers are likely to purchase a specific item.
  • Systems Architects are responsible for pulling data into the applications they build. For example, an e-commerce store might offer discounts depending on a user’s purchase history, and the infrastructure for calculating that discount is built by a Systems Architect.
  • Business Leaders understand what the data means and how others will use it.

Data engineering works with each of these groups to understand their specific needs. Their responsibilities include:

  • Gathering data requirements, such as how long the data needs to be stored, how it will be used, and what people and systems need access to the data.
  • Maintaining metadata about the data, such as what technology manages the data, the schema, the size, how the data is secured, the source of the data, and the ultimate owner of the data.
  • Ensuring security and governance for the data, using centralized security controls like LDAP, encrypting the data, and auditing access to the data.
  • Storing the data, using specialized technologies that are optimized for the particular use of the data, such as a relational database, a NoSQL database, Hadoop, Amazon S3, or Azure Blog Storage.
  • Processing data for specific needs, using tools that access data from different sources, transform and enrich the data, summarize the data, and store the data in the storage system

To address these responsibilities, they perform many different tasks. Some examples include:

  • Acquisition. Sourcing the data from different systems.
  • Cleansing. Detecting and correcting errors.
  • Conversion. Converting data from one format to another.
  • Disambiguation. Interpreting data that has multiple meanings.
  • De-duplication. Removing duplicate copies of data.

It is common to use most or all of these tasks for any data processing job.

What Kinds of Data Sources Does Data Engineering Use?

Companies create data using many different types of technologies. Each technology is specialized for a different purpose – speed, security, and cost are some of the tradeoffs. Application teams choose the technology that is best suited to the system they are building. Data engineering must be capable of working with these technologies and the data they produce.

Data Source Applications Data Structures Interface Vendors
Relational databases (operational) HR, CRM, financial planning Tables SQL Oracle, Microsoft SQL Server, IBM DB2
Relational databases (analytical) Data warehouses, data marts Tables SQL Teradata, Vertica, Amazon Redshift, Sybase IQ
JSON databases Web, mobile, social JSON documents Proprietary language MongoDB
Key-value systems Web, mobile, social Objects Proprietary language Memcached, Redis
Columnar databases IoT, machine data Column families Proprietary language Apache Cassandra, Apache HBase
File systems Data storage Files API Hadoop Distributed File System (HDFS)
Object stores Data storage Objects API Amazon S3, Azure Blob Store
Spreadsheets Desktop data analysis Worksheets API Microsoft Excel

Companies also use vendor applications, such as SAP or Microsoft Exchange. These are applications companies run themselves, or services they use in the cloud, such as Salesforce.com or Google G Suite. Vendor applications manage data in a “black box.” They provide application programming interfaces (APIs) to the data, instead of direct access to the underlying database. APIs are specific to a given application, and each presents a unique set of capabilities and interfaces that require knowledge and following best practices. Furthermore, these APIs evolve over time as new features are added to applications. For example, if your CRM application adds the ability to store the Twitter handle of your customer, the API would change to allow you to access this data. Data engineers must be able to work with these APIs.

What Are The Key Data Engineering Skills and Tools?

Data engineers uses specialized tools to work with data. Each system presents specific challenges. They must consider the way data is modeled, stored, secured, and encoded. This teams must also understand the most efficient ways to access and manipulate the data.

Data engineering thinks about the end-to-end process as “data pipelines.” Each pipeline has one one or more sources, and one or more destinations. Within the pipeline, data may undergo sever steps of transformation, validation, enrichment, summarization, or other steps. They creates these pipelines with a variety of technologies such as:

  • ETL Tools. Extract Transform Load (ETL) is a category of technologies that move data between systems. These tools access data from many different technologies. They then apply rules to “transform” and cleanse the data so that it is ready for analysis. For example, an ETL process might extract the postal code from an address field and store this value in a new field so that analysis can easily be performed at the postal code level. Finally, they load the data into a destination system for analysis. Examples of ETL products include Informatica and SAP Data Services.
  • SQL. Structured Query Language (SQL) is the standard language for querying relational databases. They use SQL to perform ETL tasks within a relational database. SQL is especially useful when the data source and destination are the same type of database. SQL is very popular and well understood by many people and supported by many tools.
  • Python. Python is a general purpose programming language. It has become a popular tool for performing ETL tasks due to its ease of use and extensive libraries for accessing databases and storage technologies. Python can be used instead of ETL tools for ETL tasks. Many data engineers use Python instead of an ETL tool because it is more flexible and more powerful for these tasks.
  • Spark & Hadoop. Spark and Hadoop work with large data sets on clusters of computers. They make it easier to apply the power of many computers working together to perform a job on the data. This capability is especially important when the data is too large to be stored on a single computer. Today Spark and Hadoop are not as easy to use as Python, and there are far more people who know and use Python.
  • HDFS & Amazon S3. Data engineering uses HDFS or Amazon S3 to store data during processing. HDFS and Amazon S3 are specialized file systems that can store an essentially unlimited amount of data, making them useful for data science tasks. They are also inexpensive, which is important as processing generates large volumes of data. Finally, these data storage systems are integrated into environments where the data will be processed. This makes managing data systems much easier.

New data technologies emerge frequently, often delivering significant performance, security or other improvements that let data engineers do their jobs better. Many of these tools are licensed as open source software. Open source projects allow teams across companies to easily collaborate on software projects, and to use these projects with no commercial obligations. Since the early 2000s, many of the largest companies who specialize in data, such as Google and Facebook, have created critical data technologies that they have released to the public as open source projects.

Data Engineering vs. Data Science

Data Engineering and Data Science are complementary. Essentially, data engineering ensures that data scientists can look at data reliably and consistently.

  • Data Scientists use technologies such as machine learning and data mining. They also use tools like R, Python, and SAS to analyze data in powerful ways. These technologies assume the data is ready for analysis and gathered together in one place. They communicate their insights using charts, graphs, and visualization tools. Data scientists must be able to explain their results to technical and non-technical audiences.
  • Data Engineering uses tools like SQL and Python to make data ready for data scientists. Data engineering works with data scientists to understand their specific needs for a job. They build data pipelines that source and transform the data into the structures needed for analysis. These data pipelines must be well engineered for performance and reliability. This requires a strong understanding of software engineering best practices. Data engineering also uses monitoring and logging to help ensure reliability. They must design for performance and scalability to work with large data sets and demanding SLAs.

Data engineering makes data scientists more productive. They allow data scientists to focus on what they do best–performing analysis. Without data engineering, data scientists spend the majority of their time preparing data for analysis.

Getting Started

With the right tools data engineers can be significantly more productive. Dremio is a new kind of technology designed specifically for this role. It simplifies and accelerates access to data. Dremio helps data engineers become more strategic and innovative. Learn more about Dremio.