One of the things I’ve heard you talk about is, and I think as someone else who works in open source, I think I’ve experienced this a lot as well, is the hey, everybody has to re-implement the same algorithm. Right? And I think that neither of us are a huge fan of that, because we know that there’s constrained resources in getting these open source initiatives built.And so what do you think the big opportunities are in the near term, in terms of starting to reduce that at different levels? Like I think right now, if Arrow at its current state, it’s best at interchange. Would you agree with that?Wes McKinney:
We’ve focused on interchange.Jacques Nadeau:
Yeah, yeah, that was the goal of Phase 1. These are my internal phases, not project phases. But I think that we’ve done that well. Do you think that there’s been some work on, for example I think we have some algorithms for dictionary encoding things, for example.Wes McKinney:
Right. And so hey, how do I get my data, if it’s in an Arrow format, how do I get it into a dictionary encoded Arrow format very efficiently?Wes McKinney:
Right. What do you think’s going to happen? Do you think that the existing systems will start to adopt? As we try to add more processing stuff to Arrow, do you think the existing systems will start to adopt that, or do you think it’s going to be the newer technologies that adopt those things, and it takes longer for the existing systems? I don’t know if you’ve thought about that.Wes McKinney:
Yeah. There’s a lot to unpack there. I think I’ll get to the last question at the end, but I think one of the things that really motivated me about the product … beyond defining the Arrow columnar format and coming up with a standard for interchange, just the re-implementation, the wheel reinvention problem. One of the things that really drove that home is looking at what has developed in the Pandas project, and now that code base is nearly ten years old, and you think that we’ve essentially built our own in-memory query engine, we have our own CSV parser, we have the interfaces to databases, we have our own link data present display and presentation system for Jupiter Notebooks, for Console.But the scope of the Pandas project, the code that we’ve developed and the code that we own has become massive. It’s a couple hundred thousand lines of code. But if you go look at a database, I had some experience working with the Apache Impala team, the folks at Cloudera, and I looked at Impala, and it’s like gosh, they’ve implemented a lot of the same things. They’ve got their own CSV reader, and they’ve got their own I/O subsystem, and ways to access all the different places where data comes from. They have their own query engine, they have their own front end, Pandas has its own front end which is Python code.And so there’s all this energy loss to people re-implementing the same things over and over again, and I think it would be easy to naively say, well let’s stop re-implementing CSV readers, and let’s stop re-implementing hash joins, and sorts, and all these things, but the problem is that all of these implementations, it all goes back to the memory, so that the implementations are specialized, and they have to be highly specialized to where the data lands after you read it out of the CSV file, after you read it out of the Parquet file.And so there’s all this energy loss to people re-implementing the same things over and over again, and I think it would be easy to naively say, well let’s stop re-implementing CSV readers, and let’s stop re-implementing hash joins, and sorts, and all these things, but the problem is that all of these implementations, it all goes back to the memory, so that the implementations are specialized, and they have to be highly specialized to where the data lands after you read it out of the CSV file, after you read it out of the Parquet file.So I’d like to see the community work together to solve these problems really well, and then have libraries that can be used in may different projects over a long period of time. And so I think what will happen, and what’s actually already happening as far as who’s using these libraries, is that existing projects in some cases will use the Arrow libraries to get access to data, strictly use the Arrow ecosystem’s high performance data access layers, and then they pay the cost of converting from Arrow to whatever format they’re using. So this is already the case with the Pandas community, so a lot of Pandas users are reading Parquet files via Arrow. I think as we develop more processing libraries, I think that slowly these communities will begin to take advantage of Arrow native processing.But it will be difficult to displace a whole ecosystem of working code. It’s like if it ain’t broke, don’t fix it. So if you have to pay this conversion cost to use Arrow, that conversion cost might outweigh the benefits of the faster Arrow processing.But I think the really interesting thing long term, and you can think on a ten or twenty year horizon, will be next generation data processing systems that design from day one to be Arrow native, and are able to focus on much higher order concerns, in terms of query optimization, and distributed computing, and not have to re-implement all these really basic things to be able to spend time focusing on the higher order big data optimizations that end up taking a lot of time.