Dremio Jekyll

Open-Source Evolution: Spark, Kafka, and More

Host Eric Kavanagh interviews several open-source experts - including Dremio CMO Kelly Stirman - about the current state of open-source development.

Listen to the entire podcast over at DM Radio.

Transcript

Eric Kavanagh:

You've heard AM, you've heard FM. Now tune into DM Radio. The world's longest-running show about data. Each week host Eric Kavanagh interviews the brightest minds in the world of information management. Want to be on the show? Send an email to info@dmradio.biz. Now here's your host Eric Kavanagh. All right ladies and gentlemen, hello and welcome back. Once again it is in fact time for DM radio. Yes indeed. Just cruising through year 11 here on the longest running show on data in the world. I'm quite sure if I'm wrong someone sent me an email and tell me or tweet with the hash tag of DM Radio. And folks the topic for today is just fantastic. It's really one that's near and dear to my heart. I've been tracking it very closely for about 13 years now.

It's 2005 when I first dove into the world of open source. In fact, I was writing about the after-effects of Katrina the hurricane that ravaged Mississippi and Louisiana of course and led to all sorts of chaos and mayhem and what happened is I was reading an article that said to the senators from Louisiana Vitter and Landrieu asked Congress for a quarter of a trillion dollars. Yes, 250 billion dollars to rebuild Southern Louisiana. While I love New Orleans and Louisiana. But I can tell you right now as soon as I heard that number I was like oh man we're going to need transparency into federal spending or that money is going to disappear. Because there are some professional politicians in these parts that are very good at making money go away. So I went on this soapbox cruise for a while and did a bunch of email marketing and a bunch of media relations. I mean I was promoting the whole concept of open source being key to getting transparency into federal spending.

And the idea is that if you remember back in the 2000 I guess 2003 and 2004 timeframe that's when the whole Enron fiasco was playing out and Sarbanes-Oxley came out demanding new standards for how boards operate and for transparency into the inner workings of businesses and processes and so forth. And I thought to myself if corporate America has to do this stuff why doesn't the government have to do it. And so I did a lot of research into open source and found the Apache Software Foundation. Of course, their first main project was the Apache Web Server and back then in 2005 it had already grown to something like 50% of all the web server or all the sites were hosted on the Apache Web server. And I thought to myself That is fantastic information right. So I got on the soap box. Crazy things happened. I got in touch with a handful of people really only like five or six journalists paid attention. One of them was Jonathan Alter from Newsweek. But another was a guy who worked at the Heritage Foundation and he picked up with it and ran with the concept citizen auditors and want to testify before Congress sometime after that.

And lo and behold and I guess it was 2006 the Federal Funding Accountability and Transparency Act hit and now we actually do have some transparency into federal spending. How crazy is that? Good things happen especially if you promote open source. That's what we're doing today on our show and we have several great guests lined up for you. We're going to hear from Louis Bajuk of TIBCO. We also have Mark Shainman of Teradata and Kelly Stirman of Dremio on the program. So let's bring in our first guest Louis Bajak of TIBCO. Welcome back to DM Radio.

Louis Bajuk-Yor:

Thanks, Eric. Good to be here.

Eric Kavanagh:

Sure. So you're very involved in the open source movement. There's a language of programming language it's very popular these days AR. There's also another one called Python which does some similar things but AR seems to be the the programming language of choice for data science and data scientists. And you're a pretty active in that community right. Why don't you tell us what you're doing for AR and why that's important for the open source movement and for enterprise software?

Louis Bajuk-Yor:

Sure thing. So TIBCO have long integrated with AR as part of our analytics and data science products and as part of that, we've developed our own platform for the AR language. As you said AR is the language of choice for data scientists. There's a million people out there using it for just the kind of projects that you were talking about transparency. One of the things that really has made AR popular is that anybody in the world anywhere can download it and use it and start using data science. So you see it used a lot in different governmental organizations especially some of the less developed countries where they're just ... Perhaps over the last 10 years have just kind of gotten started in data science and transparency and used a lot there.

In terms of what we're doing in addition to leveraging AR and integrating with AR and our products to make it easier for more people to use it in providing an enterprise platform our terror engine for running it. We're also founding members of the AR Consortium. The AR consortium is a group of organizations both vendors like ourselves to develop products around AR and companies that use AR heavily in their business where we band together we provide funding to this organization. And these organization, in turn, delivers this funding to fund projects that promote the both the technical and the people infrastructure of the AR community. For example, we sponsor local user meet ups around the world for AR. We also sponsor worldwide help sponsor. A worldwide organization called AR Ladies that's been around for a couple of years now and has had just phenomenal growth around promoting women in data science. And that really is an amazing world wide organization.

Eric Kavanagh:

Now that's great stuff and AR the fact that it's open source, the fact that anyone can download it. That is obviously what helped lead it to the forefront in terms of programming languages for data science right. Because if you can get something for free that makes it a lot more accessible to the world at large. Then I think that's a big part of why the AR communities become so prominent right.

Louis Bajuk-Yor:

I mean that's a huge part of it. Is both that anybody can download it and use it in another thing that's really driven the development of AR is that there was a frustration say 20 years ago when AR was just starting up with companies like some of the legacy data science companies like Saf of being slow to respond to customer requests. And so with the AR environment, the other thing that really helped develop that is not only is it free to download but you have thousands of people out there who say okay, I want to be able to develop this analysis or I've created this analysis for my thesis work or for my business. I want to share it with others. And to that ability for everyone to contribute to it, there are now over I think 12000 libraries of packages in the AR community available on the central repository called Cran. So there are just thousands and thousands of people out there who develop their own add-ons to AR. Their own extensions to it and share those freely.

Eric Kavanagh:

And there's a real collaborative spirit too right. I think that's one thing that makes open source so powerful is that you do have these critical masses of developers who have that communal spirit right. They're not trying to hide what they do or keep it proprietary. They want to share that and so that spirit I think really pervades the software development movement now and it's why so open source has such a fundamental component of enterprise software today right.

Louis Bajuk-Yor:

Absolutely. I mean I think there's two things from in terms of the enterprise software point of view two things that really make opensource valuable. One is where we touched on the low cost of acquisition. It's easy for someone to download and start using it right away for free and start getting exposed to technology. And the other one is this collaborative spirit. It's sharing the burden of developing software because developing software is hard. It's a lot of work developing quality software is even harder. And so one of the big things around open source software whether it's in data science whether it's in messaging which is also a big part of what TIBCO does or whether it's working with big data, all these areas AR can we share the burden of developing these important frameworks that we're all going to use.

Eric Kavanagh:

Yes that's exactly right. Collaboration is key. And like you say software development is hard especially if you want to create good stuff. One of my favorite quotes about open source as regards quality of software development is that bad code tends to go away. I mean that's not always the case obviously but because you have so much visibility because you have so much collaboration that really helps drive out bugs and drive out bad code right.

Louis Bajuk-Yor:

Absolutely. And you are having many eyes when you're developing proprietary software I mean a good software company will have things like code reviews where you might have at least one other engineer sitting down and reviewing someone's code with an open source project you might have hundreds of other engineers who are looking at your code and saying I could do that better. I could optimize that. I could improve it. I could perhaps implement it in a different way that's going to be more efficient.

Eric Kavanagh:

That's a great point. That's a great point. And you know there is some downside ... I'd be curious to hear your take on this. There isn't an aspect of open source development. Well, I guess two angles I guess I'll throw it to you just for your perspective. One is the good enough just to get it out the door side of the equation meaning it reaches a certain point and it doesn't get too much further in part because there's not so much financial reward involved. And the other is just the sort of political forecasts that occur in different movements where you will have just good old fashioned politics come into play and scuttle back good ideas rather than move forward bad ideas. What do you think about the politics and what do you think about that just good enough aspect?

Louis Bajuk-Yor:

Well the just good enough aspect I mean I think it's always a challenge especially if you have sometimes that organization will just push out some open source software to say that they've done it and not really support it moving forward and that's called dumping on the open source community. And that's not being a good contributor. That's not being a good citizen of the open source community. And so certainly in general if something gets put out and there's no one in there to pick it up, there's no one to keep going with it then it can just get to a certain level of maturity and not get any farther. And that was that the Linux Foundation Open Source Leadership Summit back in February. Went to a bunch of talks about from different organizations on how they decided to open source something and how they built a community around it.

And now it's fascinating because it just ... I hadn't realized just how much work is required to really build an open source community. If you want your project to be successful if you want it to continue moving forward and develop and mature. You really need to invest in that people side of things which touches on your second question the politics. As the maintainer of an open source project, it's critical that you encourage contributions that you help guide the project. One of the most interesting talks I went to was a talk titled Lead Open Source Projects Don't Control Them. And that's an interesting thought because is this the kind of balance between how do you deal with the fact that people, AR people and there will be politics and there will be hurt feelings and there will be pettiness because we're human with balancing that with that kind of vibrancy, that energy really helps drive things forward.

Eric Kavanagh:

That's a good point. Well, let's bring in our next guest here we've got Mark Shainman dialing in from Teradata. Welcome to DM Radio and tell us a bit about what you folks are doing for open source.

Mark Shainman:

Hi, Eric. Well, thank you. Actually Teradata a number of years ago embraced open source. And we actually have leveraged a number of open source solutions as part of our overall unified data architecture from Spark to Kafka to AR as a programming language so that could be leveraged. We actually were a major contributor to The Open Source Presto solution set as well as we open sourced data lake management software solution last year called Cailo which actually enables organizations to easily ingest and manage data into their data lake and leverage is under the covers a number of open source solutions such as nifi and Spark as well. So we're fully embracing numerous open source solutions in our infrastructure where they can add value.

Eric Kavanagh:

And Apache nifi. Let's talk about these different projects right because the way it works at least with the Apache Software Foundation and obviously there are other developers outside of Apache but Apache has become a sort of de facto leader of the movement and you have projects that people commit to. You have committers who work on projects. And Apache nifi is a pretty interesting one. Can you give us just an overview of what nifi is and then also you mentioned Spark which obviously is one of the most popular projects out there. Tell us a bit about each of those and what they do.

Mark Shainman:

Okay, sure. Apache Nifi a lot of the governance, not governance but a lot of the engineering around nifi is really done by Horton works. And what nifi is it's really an engine that allows you to set up pipelines that add connector architecture associated with it that allows you to connect to data sources, bring that data in and do multiple things to it. To it's splits and I would think that you would think of it is almost in a manner of almost like an ETL process but different but it can grab data from multiple sources has multiple connectors bring that data through multiple pipeline stages and doing things such as normalization of the data and then land it into let's say you have a Hadoop environment to hive tables and then ... So it's very powerful in that. Traditionally we always had third party vendors proprietary software that was needed to basically connect to multiple data sources and ingest data. And now there's an open source solution out there that enables organizations to leverage that, especially for their data lakes.

Eric Kavanagh:

This is the cool. Thank-

Mark Shainman:

And then I guess the second one you asked-

Eric Kavanagh:

Go ahead. Yes, go ahead. Spark.

Mark Shainman:

Spark is a solution such that it leverage widely by data scientists for computational work within their in doing deep analytics within their environment which has a huge community behind it. Then there's also a component called Spark SQL where you can actually leverage the SQL standard query language to access data, bring it in and computate it within your Spark environment as well. But Spark is what I would say one of the analytical projects that really has a lot of legs around it and a lot of organizations are leveraging it within their product sets.

Eric Kavanagh:

And really I think one of the keys for what triggered open source to become such a powerful force was the desire to avoid vendor lock-in right. That's kind of the old mentality of closed source software and just for those who don't know what we're talking about here, a closed source software solution is where the source code written to create the application is not revealed. And then you can use a compiler to make it ready for various form factors like for the Macintosh Operating System or the PC Windows environment or whatever the case may be. IOS and open source is where you can see the source code right. So it seems to me opensource is now so prevalent that it's really not likely to go away. I mean you still do have a lot of closed source software out there. AWS is pretty tight-lipped about how they do their stuff. What do you think about that battle between open source and closed source? Do you think that open source has won the day or what do you think?

Mark Shainman:

I really think that too both of them have value and can coexist within customers infrastructures. One thing to understand is that developing software is expensive and so in many cases, you might have a piece of open source software that does something that's infrastructure based. Let's say is ingesting data or doing data streaming but then a vendor can actually add a higher value to that open source solution by maybe creating self-service wrapper around it and putting the engineering dollars into it that might take the open source community years to develop and then make that available to the marketplace. So it's not saying that you have to have one or the other.

Organizations can have both coexisting within their infrastructure. For example, Teradata, of course, we have the Teradata database solution which is our proprietary software solution which is a highly performing solution set that organizations leverage to analyze data within their infrastructure. But then in many cases that coexist with data lake which is on Hadoop solution maybe a open source Hadoop solution and leverages open source solutions such as Presto or hive on top of it. That doesn't exclude an organization from having just one or the other. They can coexist in an environment and organizations can get benefit out of both of them. So there's the points where proprietary software has a value and there's higher value to that engineering and cost that goes into that privateering software. And then there's also value in the open source solutions that are out there and the community that can support them and drive them as well.

Eric Kavanagh:

Now it's a very good point. And it seems to me open source really has enabled this whole new movement of essentially standing on the shoulders of giants because you have so much durable open source software out there forming the foundation of new applications, of scalable applications. I think that's really helping to spur innovation and keep everyone doing really cool stuff. And folks we're going to continue this conversation right after our break. We've been talking to Louis Bajuk of TIBCO and Mark Shainman of Teradata. Next up Kelly Stirman of Dremio. We'll be right back folks.

Eric Kavanagh:

All right folks fascinating indeed talking all about the open source movement. Saving us from the closed source movement at least to a certain degree. I think it's just fantastic stuff. And next up we have Kelly Stirman of Dremio and you guys have a new product release. I saw it come across the transom Dremio 2.0. Kelly, welcome to DM Radio and tell us a bit about what you folks are doing in the world of open source.

Kelly Stirman:

Hey great thanks for having us. Thanks for having me. Yes at Dremio ... So TIBCO and Teradata are both very well-known companies that have been around a long time. Dremio is a new young company we launched ... We came out of stuff last year and we're introducing a new kind of product that really hasn't existed before that we call cell service data. And this is about the idea I like to use this analogy of most companies have thousands of what you might call data consumers in their business. Users of data science which is like AR and python. We talked a little bit about earlier but also tools that Tablo and Power BI and other BI tools. And they need data to do their jobs whatever that job happens to be. And the data is in systems like Teradata and data lakes and databases and Mongo DB and Elastic Search and in the cloud, it's all over the place. And Dremio is a new product that sits between the different tools and the different repositories of data and gives you Accelerated Access, gives you virtualized access to the data. You're not making copy after copy and wraps the whole thing in a self-service model so that the data consumer can do things for themselves instead of standing in the data Breadline waiting for a data engineer to provision data for their particularity. That's what Dremio is all about. And it's open source.

Eric Kavanagh:

And so I love this whole concept of self-service right because it really is representative of what I would call a massive inflection point that we're going through not just in the data world or the business world. But in life in general and certainly in news media and all sorts of other things and it kind of goes like this. We used to live in a decidedly pushed model. I'll just talk about media for a second meaning there were a handful of media people who would push out to the general public. What is news now and much more of a pull model where people go to Google and they just Google stuff and they pull something in to find out what's going on? And the same is true for data because in the let's say 20 years ago in the world of data warehousing, there was the data warehouse obviously it would pull from all these different systems but it got pushed out to key stakeholders who were part of the inner team if you will. And had the political cloud to get access to that stuff.

Whereas now once again we're seeing this strong pull from the outside of users who don't want to have to navigate through politics for example who don't want to have to navigate through whatever situation may exist in their organization. They just want to get data themselves right. So I think that we're finally at a stage now where the scalability of these solutions, the power of the processors, the thickness of the pipes through which the data goes are all such that we can truly enable self-service data analysis. And to me, that's just great news for the business. But Kelly what do you think about all that?

Kelly Stirman:

It's an incredibly important point. And I think we all experienced this with whoever you work for whether you work for a bank or a pharmaceutical company or a software company. You have this funny disparity. In your personal life data is your best friend. Like everything, you want to know Google can answer in under a second. There's an app for pretty much anything you'd ever want to do and then you get to work in your experience with data is nothing like that. You are shackled, you are blocked, you're encumbered, you're unable to get access to the data that you want. And so what you end up doing is getting in the data breadline and you're holding your little ticket with your number hoping your number gets called by the data engineer who's going to give you what you need and you end up waiting a long time.

And that's an enormous unrealized value in these very essential workers that are critical to every business function. Who aren't doing their jobs because they're waiting to get the data they need to do their jobs effectively. And so this idea of self-service you actually brought two really important points. One is that it suits the data consumer. The user of AR and Python. The user of the BI tool who wants to be more effective and efficient. But that idea of self-service is also the fundamental paradigm around the success of open source software where traditionally a technology decision was made a top of an organization and then pushed down, mandated down into the business and with opensource you don't have those barriers. You have a poor model as you said where a software engineer can go and take a piece of open source software and send it up without this bureaucracy and without being having something mandated down the company. There's actually two really important points at the same time with this pull and push model that you brought up.

Eric Kavanagh:

And it really is good news for the business. It seems to me because it has a balancing effect and it also creates a positive level of tension. I remember it was probably ... Well, in fact, I can tell you almost exactly it was back in 2004. I was working with a Canadian biscuit business consultant sales and marketing expert and he was talking about the right level of tension in an organization. And here we'll talk about tension between let's say the end users to the people who are running lines of business and the IT but also the sort of data team in the center of the organization he says. "You don't want too little tension and you don't want too much tension." Because too little things get slack. They get lax. Maybe mistakes happen. Too much people get frayed around the edges.

They get upset. You have battles and so forth. But the right amount of tension kind of keeps everybody talking to each other and everything moving smoothly. And I think that tension pulling out from the ends of the business, from the business people out in the field or the line of business managers. That's very good for keeping it and natural tension and a solid line of communication to the people who are still the engineers, who are still responsible for delivering solutions that can give the data to the people who need it right. So I think it's a very positive thing overall that we have this pulling from the outside now and it is to your point in large part due to open source and due to solutions like what you guys are building right?

Kelly Stirman:

Yes. The tension is an interesting topic because one thing you might think from an idea like this is that oh the data engineers are the central IT function goes away. We used to have operators or elevators. Well once we put buttons in, you didn't need the elevator operator to get to the floor you were knowing to. We used to have travel agents. When's the last time you talked to a travel agent and just book on travel. I don't think the same is true in IT. I think that there's still a lot of complexity, a lot of need for governance, a lot of need for security and basic efficiency around the centralized functions and in business that are going to be with us for many many years to come. But what we have to do is to make that function more efficient. And part of making that function more efficient is making big pieces of it more of a self-service model.

Eric Kavanagh:

Yes. That's right. I mean you want self-service but you aren't governed self-service to your point. One of the cool concepts I've heard from folks in that space is the idea of guardrails right. So you want to bake into the architecture of your solution. Some barriers around which people cannot go to ensure that they don't get the wrong data sets, for example, we talk a lot about personally identifiable information PII data or just other data sets that may not be appropriate. So you still need your engineers and your IT team to create the proper infrastructure which enables self-service and you want your business people, your stewards primarily to be in control of the governance side, the curation side of that right.

Kelly Stirman:

Yes. That's actually really up to them. So what would a guard rail look like for sensitive data? Well, the guard rail would be the central IT function, the engineering group saying okay these are identified pieces of sensitive information and we have established certain groups in our central security controls that should be able to see for example the Social Security Number of an employee. And maybe that's a privileged group in HR versus everyone else who should either not see that or see a masked representation where you have access over all the last four digits of the Social Security Numbers. And so the guardrail is ... I keep putting that policy in place and ensuring that the different systems respect that policy. But then making it so that people can access the data and they don't accidentally get exposed to sensitive data they only see the appropriate data at all times.

Eric Kavanagh:

Well, you guys are doing something I think which is really a bellwether of where things are going in this industry and that is accessing data where it lives. We spent so many years moving data around and it's hard to do that. It takes time, it takes energy. We referred earlier I think Mark was referencing ETL that's the Extract Transform Load part of the process where you literally extract data from one system transform it into some suitable format and then load it into another system. What you really want to do is leave data where it sits and be able to analyze it in some sort of virtual layer right. Because when you move data, number one it's expensive. It's error-prone. It's brittle. A lot of times over the years big companies have developed such an incredible matrix of ETL processes that there's really no visibility into what's going where at a strategic level. So ideally you want to leave the data where it sits and be able to access it only as needed right?

Kelly Stirman:

Yes absolutely. So you brought up several things the cost the Bridal it's also sale that the copies create problems under themselves. Copies can conflict with each other. Copies create governance challenges that are very very hard to keeping and control and keep in check. And so one of our core tenants is everyone should be able to get exactly the slice of data that they want. The filtered representation, the shape of data that they want from the data lake, from the data warehouse, from systems A to Z but you can't meet that. Everybody gets what they need and give everybody a copy because then you end up with thousands of copies of data. The way we think the right way to do this is to revertualize access mechanism where everyone can get exactly what they want but there's no copy of the data it involves. And that led to data consumer get exactly what they want and need without adding the burden of thousands of copies of data to central IT.

Eric Kavanagh:

Now that's exactly right. And I think we could probably just go ahead and blend right into the Roundtable now and let's hear from our other guests. I really do love this concept of accessing data where it lives. Obviously, it doesn't make sense and all use cases. But maybe Louis I'll bring you back into the equation here and thanks for your tweets by the way. Everyone tweet with the hash tag of DM Radio. I think that this is emblematic of the new way of doing things that ideally you're going to let data sit where it is and you'll pull only what's necessary to do your analysis. But Louis, what do you think about that?

Louis Bajuk-Yor:

Oh absolutely. I mean that's the direction that things are clearly going in. There's still a place for in-memory analytics because when you're prototyping when you're working with data, when you're pulling data from your Excel spreadsheets and all that. Now there's certainly a role for that. In general especially with all the data that we're collecting now and the big data systems that are out there things like Spark that leaving the data in the data source is critical. I mentioned before the show that we had recently acquired Alpine Data Lab. A great product that sits on top of big data sources like Spark. And basically Alpine provide a way of dragging and dropping your and trading your Data Pipeline step he want to do. But that's just a visual representation. In that case, all the actual work is getting pushed down into Spark or into the database with the data set so that you get the best of both worlds. You've got the scalability of spark without moving the data in a much easier UI than programming [inaudible 00:35:30] very critical. There's a lot of people out there doing very interesting projects around keeping the data in the database.

Eric Kavanagh:

I'm just looking at all these acquisitions you guys have made. Jasper Soft, Alpine Data Labs of course composite software just recently TIBCO is really rounding out its stack and I think that's again another straw in the wind about where things are going these days. You're seeing some of the venerable vendors out there really flesh out their offerings to be able to compete pound for pound across the whole data landscape right?

Louis Bajuk-Yor:

Absolutely. And as one of the other speakers was saying a moment ago there is definitely a place there for both proprietary and open source software. And what we're striving for is a blend there so that we can provide single platform where architects and developers and users can access a variety of capabilities whether it's proprietary closed or software or open source software in a single platform. For example just a couple of days ago messaging is a core part of what TIBCO has done for over 20 years. Major organizations like airlines and FedEx use TIBCO messaging as the backbone for their businesses. And we announced a couple of days ago that TIBCO messaging is going to be in tightly integrating with Kafka. So that our customers will be able to use Kafka and QTT another protocol and TIBCO messaging all within a single platform to serve the variety of different use cases that they want to tackle.

Eric Kavanagh:

I'm glad you brought up Kafka since we have it in the title of our show. We should talk about it a bit. That of course came out of LinkedIn and was the message boss basically that ran LinkedIn and has now been open sourced and has really had a tremendous amount of uptake across the industry right with Kafka streams enabling you to access just really powerful streams of data and then mix and match and use it to kind of Supercharger your information systems right Louis?

Louis Bajuk-Yor:

Absolutely. I mean there's a tremendous amount of energy out there in the Kafka community as you said it came out of LinkedIn that certainly was built from the beginning to be very scalable and now we've seen a lot of adoption out there. We've got a couple of offerings in the streaming space both messaging and stream dates and both of those are integrating with Kafka because we see that energy there.

Eric Kavanagh:

And that really is one of the keys I think to open source success is the key in any industry really. If you think back to the late 80s and into the early and mid-90s while Microsoft was the king right. Microsoft reigned supreme. Microsoft was the standard operating system. Everybody wrote for Microsoft. And then of course what happened we've talked about it before on the show Linux came along and IBM was tired of getting the rug pulled out from underneath them. And so they invested a billion dollars building up Linux as a strong sturdy durable operating system. And wow the world changed from that point forward. Thanks to Big Blue. Thanks to Linus Torvalds of Linux and we'll keep this conversation going right after the break. Folks stand by. Don't touch that dial. You're listening to DM Radio.

Eric Kavanagh:

The software development world, software development lifecycle I think for the better. At the end of the day and we were just talking about the whole pull versus push and the concept of leaving data where it lives as opposed to moving data somewhere like into a warehouse. So let me talk about this whole movement and I would have to say Mark Shainman you made a good point there really it's an architectural decision these days right. It's not necessary that you go Federation versus traditional movement. You need to make a decision about what makes sense for your organization and Dave Wills talks a lot about that. He talks about the importance of the proper architecture for your information. Tell us a bit about that.

Mark Shainman:

I mean one of the biggest things organizations have to realize is you have to first look at what are the business drivers? What is the problem you were trying to solve? And before you jump down into technology with a lot of people like to do oh I want this new piece of open source technology. There's a lot of buzz around it. You have to look at what are your business needs and then look at your what are the architectural decisions you need to make to achieve those business needs. And you have to look at things like what are my SOA. So if I have very very tight SOA some large number of concurrent users then there is a value and benefit in a business decision to co-locate that data as single platform versus trying to federate it.

Or if it is hey I have archived data that exists in another platform and I have data that exists within my main database or data warehouse. Then let the data lay where it lies. Maybe it's my Hadoop environment or in my traditional data warehouse and I can query across those. Teradata has a solution by the name of query grid that enables that kind of fabric to connect so you can actually do that analysis across platforms and then bring the final result set and join it and process it in the initial execution engine. But before organizations say oh I'm going to put in place a federated solution. I'm going to put in place something a data fabric like query grid or I'm going to co-locate the data you have to say what is the business driver and what is the best architectural decision to meet that business need.

Eric Kavanagh:

Now that's an excellent point and I think the real key is is to have that strategic view and to your point, Mark, to understand the business case right. This kind of gets us back to the topic of tension. Maybe, Louis, I'll bring you back in here. You want there to be the optimal level of tension between the business for example and the IT and the developers and so forth and you want to be able to from a senior executive perspective understand the big picture and know when and where to invest certain amounts of money to create a new Federated environment. I do think that there are going to be certain movements like we say and I think we are moving more and more towards an environment where you are going to want to federate but that there's a long tail I suppose is the key to legacy systems and approaches right. And it can be very dangerous to move too quickly to capture a new trend or to do things in a new way. You always want to make sure you're catering to the business users and you're doing what at least the critical people in your organization want you to do right.

Louis Bajuk-Yor:

Absolutely. And I think part of that where self-service and open source comes in in that in that you have that creative tension. And part of what kind of drives that forward is that open source software as well as products like Spotfire where you can do self-service analytics for a business user both these sorts of things allow individuals to try new things without too much control. Allow them to try new things, introduce new ways of looking at the data. New ways of moving data back and forth and new ways of connecting processes and so, on the one hand, you might have a centralized IT organization or a centralized business unit driving requirements from the top down. But self-service whether it's open source or not really allows new ideas to come into the system and you get a survival of the fittest. If you've got multiple different things going on in multiple different approaches some of those are going to ideally flow to the top in that creative tension that you're talking about. So that the organization can say oh actually this is a better way of doing it than I'm doing it now. Let me incorporate that or at least incorporate the ideas of that into our systems.

Eric Kavanagh:

And then you get right back to the value of collaboration and Kelly I'll throw this one over to you. To me, that's so critical that you have people working together talking to each other especially across departments. For example, managers talking to each other. You don't want just that top down or bottom up thread of communication. You want threads to be multi-dimensional. You want to hear from your partner, you want to hear from people in other departments of your organization how they do things because that's when you learn stuff. That's when you find out hey actually there's a shortcut to get this particular job done. This is how Judy does it in accounting. Wow, that's great. You want people sharing their perspectives and sharing their processes how they do things. That's why collaboration is so important right Kelly?

Kelly Stirman:

Oh yes. I mean there's many shoulders of giants that we can each be standing on instead of trying to pick ourselves up. And those peers in other departments that we stand to benefit from. But I actually think what you just said applies to open source in some interesting ways. And typically when we think about open source we think about one particular project and a group of people coming together to build that project. But increasingly there are some projects that spanned projects and in what's very relevant to our conversation today is a project called Apache Arrow. And so arrow is about this fact that you have different consumers of different tools like AR and Python and Spotfire and Tablo and all these things that we've mentioned today who are all looking to access the same data. And traditionally there has been no standard for representation of that data for in-memory computing.

And Arrow is a standard that was created about two years ago so that all of these different processes whether it's Spark or Python or SQL execution engine or a BI tool could consume the same data structures directly and memory instead of wasting 60 to 80% of the CPU converting the data between different structures. Arrow is a way for different projects to benefit enormously in terms of CPU and GPU efficiency to drive better experiences for users and how they're using the data. And there's greater efficiency of infrastructure overall. So the very exciting projects that's been adopted in Spark and Dremio in a number of other different open source projects that is the new standard for the way people organizing data and memory.

Eric Kavanagh:

That's a really cool point you just made there and it kind of speaks to the value of just tracking the open source movement and maybe Louis I'll bring you back into this. It's like a full-time job right. I mean just to know what's going on even just in the Apache Software Foundation there are so many projects. Some of them very early stages some of them have been around for a while but it really pays dividends to watch and see where these things are going. And to have the blessing of certain organizations to do pay attention. So you know where to invest your time and where not to invest your time because the level of innovation is so significant these days that any number of processes that you have in place in your organization right now can be really fundamentally disrupted in a very positive way. Right, Louis?

Louis Bajuk-Yor:

Absolutely. I mean there's a tremendous amount of innovation going on in the open source community across many different areas. I mean even just narrowing it down to the AR community as I mentioned earlier thousands of libraries out there that package authors are developing thousands of different ways of analyzing data trying to tackle specific use cases, specific applications, and a slightly better way or a more streamlined or efficient way. Getting better answers or deeper insight. And then if you look at the expand that to machine learning and AI which we haven't really touched on here but is a whole nother area of opensource developments. Things like tensor flow of really getting deep insight into your data and building automated systems that can do things like image recognition or voice recognition or chat bots or whatnot. Ton of technology out there for developing and just really transforming our business in every possible way.

Eric Kavanagh:

That's great stuff. And Mark Shainman I'll bring you back in to ... You really need a team to be focused on the open source movement. It seems to me you need someone whose job it is to just track these different projects where they stand, how well they work and that gets us back to collaboration right. I think one of the cooler things about some of these bigger projects is that you have fierce competitors providing committers on projects like the Hadoop project. So last word from you Mark on thoughts for a why open sources so important.

Mark Shainman:

I think that even though we compete against each other a lot of our companies and with them, numerous companies are were both contributors to open source projects that benefit the community and software companies as a whole. So it is important. And one thing I didn't want to mention is not every open source solution is an Apache project, for example, open source Presto is still governed by Facebook but has a large community behind it. A very made up of Netflix, Airbnb, Uber, FINRA others and Teradata as a contributor to that as well. So one thing is it's great that a lot of even our traditional software vendors have gotten onboard and realize there's a value in opensource but it's also great when you see innovative companies like Facebook or Netflix as well contributing to that software as well that's benefiting all the people in the community.

Eric Kavanagh:

That's right. All right, folks, we've burn through another hour here on DM Radio. Big thanks to all of our guests today. Look them up online. Teradata, TIBCO, and Dremio. Send an email to yours truly info@dmradio.biz. We are always curious to know what you want to learn about us. And we're doing the ED Cal for the second half of the year right now so don't be shy. Tweet with the hash tag of DM Radio. Let us know what you want to know. We'll talk to you next time folks. You've been listening to DM radio.