Data is at the center of every business today. Companies use data to answer questions about their businesses, such as:
In most organizations, many different systems create data. Each system may use a different technology, and each has a distinct owner within the organization. For example, consider data about customers:
Together this data provides a full understanding of the customer. However, these different data sets are independent of one another. This makes answering certain questions – like what types of orders result in the highest customer support costs – very difficult. This kind of analysis is challenging because the data is managed by different technologies and stored in various structures. Yet, the tools used for analysis assume the data is managed by the same technology, and stored in the same structure.
Even relatively small companies might have millions of customers, and huge amounts of data to comb through to answer these questions. Data engineering is about supporting that process – making it possible for consumers of data, such as analysts, data scientists, and executives – to reliably, quickly, and securely inspect all of the data available.
Data Engineering helps make data more useful and accessible for consumers of data.
Data engineering must source, transform, and analyze data from each system. For example, data stored in a relational database is managed as tables, like an Excel spreadsheet. Each table contains many rows, and all rows have the same columns. A given piece of information, such as a customer order, may be stored across dozens of tables. In contrast, data stored in a NoSQL database such as MongoDB is managed as documents, which are more like Word documents. Each document is flexible and may contain a different set of attributes. When querying the relational database a data engineer would use SQL, whereas MongoDB has a proprietary language that is very different from SQL. Data engineering works with both types of systems, as well as many others, to make it easier for consumers of the data to use all the data together, without mastering all the intricacies of each technology.
For these reasons even simple questions can require complex solutions. Working with each system requires understanding the technology, as well as the data. Once data engineering has sourced and curated the data for a given job, it is much easier to use for consumers of the data.
As companies become more reliant on data, the importance of data engineering continues to grow. Since 2012 Google searches for the phrase have tripled:
Google searches for Data Engineering. From Google Trends.
And in that time, job postings for this role have also increased more than 400%. Just in the past year, they’ve almost doubled.
Data Engineering job listings. From indeed.com.
As data becomes more complex, this role will continue to grow in importance. And as the demands for data increase, data engineering will become even more critical.
Data engineering organizes data to make it easy for other systems and people to use. They work with many different consumers of data, such as:
Data engineering works with each of these groups to understand their specific needs. Their responsibilities include:
To address these responsibilities, they perform many different tasks. Some examples include:
It is common to use most or all of these tasks for any data processing job.
Companies create data using many different types of technologies. Each technology is specialized for a different purpose – speed, security, and cost are some of the tradeoffs. Application teams choose the technology that is best suited to the system they are building. Data engineering must be capable of working with these technologies and the data they produce.
|Data Source||Applications||Data Structures||Interface||Vendors|
|Relational databases||(operational) HR, CRM, financial planning||Tables||SQL||Oracle, Microsoft SQL Server, IBM DB2|
|Relational databases (analytical)||Data warehouses, data marts||Tables||SQL||Teradata, Vertica, Amazon Redshift, Sybase IQ|
|JSON databases||Web, mobile, social||JSON documents||Proprietary language||MongoDB|
|Key-value systems||Web, mobile, social||Objects||Proprietary language||Memcached, Redis|
|Columnar databases||IoT, machine data||Column families||Proprietary language||Apache Cassandra, Apache HBase|
|File systems||Data storage||Files||API||Hadoop Distributed File System (HDFS)|
|Object stores||Data storage||Objects||API||Amazon S3, Azure Blob Store|
|Spreadsheets||Desktop data analysis||Worksheets||API||Microsoft Excel|
Companies also use vendor applications, such as SAP or Microsoft Exchange. These are applications companies run themselves, or services they use in the cloud, such as Salesforce.com or Google G Suite. Vendor applications manage data in a “black box.” They provide application programming interfaces (APIs) to the data, instead of direct access to the underlying database. APIs are specific to a given application, and each presents a unique set of capabilities and interfaces that require knowledge and following best practices. Furthermore, these APIs evolve over time as new features are added to applications. For example, if your CRM application adds the ability to store the Twitter handle of your customer, the API would change to allow you to access this data. Data engineers must be able to work with these APIs.
Data engineers use specialized tools to work with data. Each system presents specific challenges. They must consider the way data is modeled, stored, secured, and encoded. These teams must also understand the most efficient ways to access and manipulate the data.
Data engineering thinks about the end-to-end process as “data pipelines.” Each pipeline has one one or more sources, and one or more destinations. Within the pipeline, data may undergo several steps of transformation, validation, enrichment, summarization, or other steps. Data engineers create these pipelines with a variety of technologies such as:
New data technologies emerge frequently, often delivering significant performance, security or other improvements that let data engineers do their jobs better. Many of these tools are licensed as open source software. Open source projects allow teams across companies to easily collaborate on software projects, and to use these projects with no commercial obligations. Since the early 2000s, many of the largest companies who specialize in data, such as Google and Facebook, have created critical data technologies that they have released to the public as open source projects.
Data Engineering and Data Science are complementary. Essentially, data engineering ensures that data scientists can look at data reliably and consistently.
Data engineering makes data scientists more productive. They allow data scientists to focus on what they do best: performing analysis. Without data engineering, data scientists spend the majority of their time preparing data for analysis.
With the right tools, data engineers can be significantly more productive. Dremio is a new kind of technology designed specifically for this role. It simplifies and accelerates access to data. Dremio helps data engineers become more strategic and innovative. Learn more about Dremio.