Cloud data warehouse vendors lured companies with the promise of elasticity and cost-effectiveness that wasn’t possible with on-premises systems. However, companies quickly realized that once they load their data into the warehouse, they’re completely locked into the warehouse vendor’s ecosystem. More disturbingly, they’re locked out of any other technology they could use to derive more value from their data.
Companies now understand that openness is a key advantage of the cloud data lake/lakehouse over the data warehouse. They’re rethinking their approach to delivering analytics for their organization, and are looking to build an open architecture that allows them to be flexible and reclaim ownership over their data.
Data warehouse vendors like Snowflake are seeing this happen in the field, and are (unsurprisingly) scrambling to defend their position. Snowflake went as far as publishing an article titled “Choosing Open Wisely” — meaning, “Openness Doesn’t Matter.”
But what does it mean to be open, and why should companies care? I think it means three things:
- Flexibility to use multiple best-of-breed services and engines on your company’s data. You can use Dremio (best-of-breed SQL), Databricks (best-of-breed batch/ML), EMR, Athena, Redshift Spectrum, Presto, Dask, Flink, or whatever else you want to process the data. Companies have lots of use cases and needs, so being able to use the “best tool for the job” really means higher productivity (of the data team in particular) and lower cloud costs (because each engine is optimized for different workloads). And needless to say, there’s no one vendor that can provide all the processing capabilities a company needs.
- Not being locked into a vendor. How many times has a Teradata customer told you they want to move off Teradata (many!) and how many times has a Teradata customer told you they’ve actually moved to a new platform (few!)? When you have 100K or a million tables in your data warehouse, and hundreds of complicated ingestion pipelines, it’s impossible to change platforms. For comparison, if you’re using Dremio on your cloud data lake today, and tomorrow someone invents a better SQL engine, there’s no need to migrate your data, you can just start querying all your existing data with the new system.
- Ability to take advantage of tomorrow’s innovations. Not being locked in is important because it prevents the vendor from being able to extort your company financially, which is what lock-you-in vendors like Oracle and Teradata (and soon Snowflake) do. But what’s equally important is the ability, even if you still like the vendor, to be able to pick up and enjoy new technology. When a new machine learning service or a better batch processing engine is created (Spark came after MapReduce, what will come after Spark?), wouldn’t you want to be able to use it?
These are the true benefits of having the data in a data lake. That’s in addition, of course, to the scalability and availability of data lake storage (S3, ADLS), which makes data ingestion and ELT easier (because there’s no need to do much work before you put the data in the lake).
In their article “Choosing Open Wisely,” Snowflake argues why companies shouldn’t choose an open data architecture. Let’s take a closer look at some of their key points:
- If Snowflake can export a table to an open format, that’s sufficient for eliminating vendor lock-in. There’s no need to store the data in an open format and make it accessible. This is incorrect. Teradata and Oracle can export tables to files, and nobody would consider those systems to be open. Being able to export a table to a file doesn’t help because data is constantly changing, and in no real-world scenario could an organization move off of a production data warehouse by shutting down one evening and copying all tables into a new system.
- There’s no way to evolve data formats if they are open. This is incorrect. We’ve seen the rise and evolution of open formats in recent years. File formats like Apache Parquet and memory formats like Apache Arrow have evolved while maintaining openness and compatibility with a wide variety of engines. Table formats and metastores like Apache Iceberg, Delta Lake and Nessie have been embraced by a variety of engines.
- There’s no way to deliver security or governance when files are accessible. This is incorrect. Cloud data lakes are already in use by the world’s largest and most regulated companies. It’s easy to restrict access to files in S3 and ADLS, and in practice access is typically limited to the services and engines that require access. This argument also ignored the fact that most companies, even when using a data warehouse, are also storing the same data in their cloud data lake. Furthermore, table formats like Apache Iceberg and Delta Lake provide an open source table-level abstraction on top of the files, making it trivial to apply policies at the table, column and row level.
Snowflake and other data warehouse vendors will continue to make bold arguments against open architectures because they know that companies are increasingly able to do everything they can do with a data warehouse (and a lot more, of course!) with a cloud data lake/lakehouse. Application development has transitioned from monolithic to open architectures, and data analytics is making the same transition. Snowflake is now on the defense — because they know it’s wise to choose open.