Data Engineers, Here’s What You’ve Been Waiting For!
February 11, 2022
Data engineers and data teams have been expected to do the impossible for the past several years. Data and data analytics architectures have changed continuously and drastically over the past few years and decades.
Although these architectures have modernized from traditional data warehouses to data lakes and data lakehouse architectures, the remnants of past technologies such as data pipelines and the resulting multiple data copies still haunt the data engineering world.
We understand! We spoke to several data engineers from different companies and verticals and made a list of the 10 top-of-mind items and challenges faced by data engineers.
Data analytics is a strategic initiative and data teams are key players!
Information is a competitive weapon for most if not all companies. Organizations have increasingly competed on the effectiveness of their information systems to make better business decisions over the years.
Data analytics has become a strategic imperative in all organizations, and has especially proven to be a competitive advantage in data-driven orgs. Analytic and predictive models lie at the heart of everything in an organization and help in:
- Boosting operational efficiency
- Increasing cross-sell/up-sell
- Exploring new market opportunities
- Reducing customer attrition
- Understanding the competitive landscape
- Optimizing products and services
- Gaining an in-depth understanding of customers
- Improving client experiences
- Proactively addressing problems and mitigating risks
- Reducing organization-wide risk and overall costs
…and much more!
Even marginal improvements in the effectiveness of analytical systems can have a noticeable impact on the organization’s bottom line. And data engineers play a key role in designing, operating, and supporting the increasingly complex environments that power modern data analytics.
Open data lakehouse architectures are changing the landscape forever!
We understand that the data engineer’s world has gotten more and more complex as the demand for data has increased exponentially over the past few years and decades. Our vision at Dremio is to make corporate data as easy to access and use as personal data. Hence, Dremio has changed the way many think about mission-critical BI and self-service data access directly on the data lake, which has simplified the lives of data engineers and data teams. With the steady increase in data and the current innovations in analytical architectures and landscapes, it is time to say SO LONG to proprietary data warehouses — open lakehouse architectures are changing the landscape forever!
The founder and Chief Product Officer of Dremio, Tomer Shiran, identified the following 3 key trends shaping the future of data infrastructure:
- Shifting toward open data architectures
- Making infrastructure easier
- Making data engineering and data management easier
Why have traditional and cloud data warehouses become a nightmare in the modern world of analytics?
Companies adopted cloud data warehouses because of the promise of elasticity and cost-effectiveness which wasn’t possible with on-premises systems. Over time, these companies quickly realized that once they loaded their data into the data warehouse, they were completely locked into the warehouse vendor’s ecosystem. Even worse, they were locked out of any other technology they could use to derive more value from their data.
Besides the inherent data warehousing problems, such as complex data pipelines and the plethora of unnecessary expensive data copies that proliferated their landscapes, organizations have been extremely frustrated with the inflexible vendor lock-in model associated with data warehouses.
What does it mean to be open and why should you care?
Organizations today realize that openness is a key advantage of the cloud data lake/lakehouse over the data warehouse. To make themselves more flexible and future-ready, and to reclaim ownership over their data, organizations are rethinking their approach to delivering analytics for their organization and adopting open architectures.
Companies adopt modern cloud data lake/lakehouse architectures because of their numerous benefits, such as: 1) cost-effectiveness, 2) scalability, 3) choice, 4) democratization, and 5) flexibility.
First, let’s clear some misinformation about being “open” that is incorrectly spread by certain cloud data warehouse vendors!
There are still some cloud data warehouse vendors who are making desperate attempts to defend their positions and closed architectures by spreading misinformation about being open! It is very important to clear some misinformation that is spread by these cloud data warehouse vendors as self-fulfilling prophecies:
- Providing the ability to export a table from a closed data warehouse to an open format is NOT sufficient for eliminating vendor lock-in!
There are some cloud data warehouse vendors who allow data from their closed warehouses to be exported into open format tables. But, just making data available and accessible by exporting to an open table format does not constitute open architecture! For example, Teradata and Oracle allow export of tables to files, but this is not considered an open architecture. Being able to export a table to a file doesn’t eliminate vendor lock-in because data is constantly changing, and it is not realistic for an organization to eliminate their dependency on proprietary data warehouse by shutting it down and copying all tables into a new system or open tables.
- Closed data warehouse vendors spread misinformation that open data formats cannot evolve over time!
History is evidence that this is not true. Several open formats have risen and evolved in recent years. For example, file formats such as Apache Parquet and memory formats such as Apache Arrow have evolved, while maintaining openness and compatibility with a wide variety of engines. Table formats and metastores such as Apache Iceberg, Delta Lake, and Nessie have evolved over the years and are widely embraced by a variety of engines.
- Another misinformation from closed data warehouse vendors is that security and governance cannot be delivered when files are accessible!
The world’s largest and most highly regulated organizations use cloud data lakes. It is easy to restrict access to files in S3 and ADLS, and in practice, access is typically limited to the engines and services that require access. This claim by data warehouse vendors is ironic because most organizations store the same data in their cloud data lake even when using a data warehouse. Also, open table formats such as Apache Iceberg and Delta Lake provide an open source table-level abstraction on top of data files, making it trivial to apply policies at the table, column, and row level.
So, what does it mean to be open?
Being “open” can be boiled down to three main features:
- Flexibility to use multiple best-of-breed services and engines on your company’s data.
There is no single vendor who can provide all the processing capabilities a company needs. Organizations have several use cases and requirements, and an open architecture allows them the flexibility to adopt the “best tool in the market for the job,” rather than being restricted to one vendor. For example, companies can use Dremio (best-of-breed SQL), Databricks (best-of-breed batch/ML), EMR, Athena, Redshift Spectrum, Presto, Dask, Flink, or whatever suits their needs best to process their data. This leads to higher productivity, especially for the data team, and lower cloud costs, because each engine is optimized for different workloads.
- No vendor lock-in.
Companies are stuck with hundreds of thousands (or even millions!) of tables in their data warehouses and hundreds of complicated ingestion pipelines in their data warehouses — due to the sheer architecture and design of data warehouses. This makes it impossible for them to change platforms, even if they are highly convinced of a much better tool for their use. In contrast, if you’re using Dremio on your cloud data lake today, and tomorrow someone invents a better SQL engine, you will not be required to migrate your data to the new engine. You can just start querying all your existing data with the new engine. This is being open — no vendor lock-in!
- Ability to take advantage of tomorrow’s innovations.
What companies have discovered over the years is that when they are locked into specific vendors (such as Oracle, Teradata, etc.), they have the power to extort the company financially. And what’s even more frustrating is the inability of companies to adopt new technologies and enjoy new services as they come into the market, because they feel like they have become prisoners of their own data locked-in with a single vendor!
- The unprecedented creation of and demand for data, and the residue of older technology characteristics that include vendor lock-in, innumerable and messy data pipelines, and multiple expensive data copies create innumerable challenges to the data engineering world.
- There’s misinformation and self-fulfilling prophecies that are spread by certain cloud data warehouse vendors that need to be rectified so that data engineers can understand the true meaning of “open” data architectures and make effective decisions for their analytic systems, which are key to the success of almost every organization nowadays.
- A truly “open” architecture does not lock you into one single vendor by imprisoning your data in their architecture and allows you the flexibility to adopt new best-of-breed technologies as they become available in the market. This means that no single vendor controls/owns your data!