What Are Data Pipelines? from 2017-04-14
What are Data Pipelines? Data moves between systems. Data engineers create data pipelines to orchestrate the movement, transformation, validation, and loading of data, from source to final destination. We’ve published a new page - What Are Data Pipelines? - to help explain this critical concept and related technologies.
What Is A Data Warehouse? from 2017-03-20
What is a Data Warehouse? For over 30 years, we’ve approached data analytics the same way - copy the data, move it to a different system, spend a fortune, wait an eternity. This is building the data warehouse. We’re hard at work on a new alternative. We’ve published a new page - What Is A Data Warehouse? - to explain the basics of data warehousing and to summarize the leading technologies.
ETL Tools Explained from 2017-03-15
What is ETL? The ETL model has been in use for over 30 years - read data from different sources, apply transformations, then save the results in a different system for analytics. Modern hardware and distributed processing create new models for accessing data for analytics. We put together a survey of current ETL tools in the market to help users understand their options in terms of commercial products, open source projects, and cloud services.
What Is Data Engineering? from 2017-03-08
What is Data Engineering? There’s a relatively new role in many companies called Data Engineering. This team is responsible for making it easier for analysts, data scientists, and systems to access and analyze data. Because the role is new, there are lots of questions about what this team does, what tools they use, and how they work with data. We’ve published a new page - What Is Data Engineering? - to help explain the topic.
BI on Big Data: What are your options? from 2016-06-08
Deciding what combination of technologies will yield the best ‘BI on Big Data’ experience can be a major challenge for data professionals. This presentation, given by Dremio CEO Tomer Shiran at Strata + Hadoop World London, aims to shed some light on some of the solutions that are available in the space. In this session, three general classes ‘BI on Big Data’ solutions were investigated: ETL to Data Warehouse (custom scripts, Informatica, Talend, Pentaho) Monolithic All-in-one Solutions (Datameer, Platfora, Zoomdata) SQL-on-Big-Data (Apache Drill, Apache Impala, Presto, Hive, Spark, Kylin, AtScale) The following slides contain architectural descriptions, pros and cons, and a needs-based heuristic to assist those trying to settle on a BI on Big Data solution.
Introducing Apache Arrow: Columnar In-Memory Analytics from 2016-02-17
Apache Arrow establishes a de-facto standard for columnar in-memory analytics which will redefine the performance and interoperability of most Big Data technologies. The lead developers of 13 major open source Big Data projects have joined forces to create Arrow, and additional companies and projects are expected to adopt and leverage the technology in the coming months. Within the next few years, I expect the vast majority of all new data in the world to move through Arrow’s columnar in-memory layer.
Parsing EPA vehicle data for linear correlations in MPG ratings from 2016-02-11
In the previous day’s post I demonstrated how to code a custom aggregate function that can process two sets of data points into their corresponding Pearson’s r value, which is a useful indicator of variable correlation. Today I’m going to put that function to the test on this EPA data set that contains information about vehicles manufactured from model years 1984 to 2017. The two questions I’d like to answer are: 1.
What can LIGO see? Let's look at gravitational waves with SQL from 2016-02-11
It’s difficult to overstate how thrilling today’s news about gravity waves is. The scientific community has been waiting a long time for this, and verification of the phenomenon has wide reaching implications in the fields of both astrophysics and particle physics. Gravity is, after all, the biggest thorn in the side of modern theoretical particle physics. From a particle-centric standpoint gravity wave detection is identical to ‘graviton’ detection. This is important because gravitons amount to the ‘missing link’ between the currently disconnected realms of the very big (dictated by Einstein’s general relativity) and the very small (governed by quantum mechanics).
Calculating Pearson's r using a custom SQL function from 2016-02-10
Lately I’ve written a lot of custom functions to assist me in my example Drill analyses, but they’ve all been of the same fundamental type: They take one or more columns of a single row and process them into a single output. The Drill documentation calls these “simple” functions. However there’s another class of functions lurking out there—ones that can accept many rows of data as input. We call them “aggregate” functions.
Managing variable type nullability from 2016-02-09
One day you may find yourself with a custom function for Drill that’s very particular about the kind of variables that it accepts. In particular, it may hold strong opinions about whether or not a variable is allowed to express a NULL value. In fact it may even be you who wrote this unavoidably fussy function (SPOILER: This is exactly what happened to me earlier this week). Currently Drill lacks built-in functions to add or strip nullability from variables, but luckily it’s very easy to whip up a couple UDFs which do exactly that.
Smartest and dumbest subreddits as judged by submission title readability from 2016-02-06
Alright, time to put the readability UDF from my last post to work on some data! For today’s analysis, I’ll once again use this Reddit submission corpus, which contains submission data from from the years 2006-2015. The questions that motivate today’s analysis are simple, but fun: Which popular subreddits have the highest average submission title reading level? Which ones have the lowest? Or, more glibly, “Which subreddits are smart, and which are dumb?
Querying for reading level with a simple UDF from 2016-02-05
Today, just like in this post from the previous week, I’d like to discuss creating a simple custom SQL function for Drill that maps strings to float values. Except this week’s function is even more simple because it can fit within a single file and requires no instructions in the setup() method. In fact, this may be the simplest example of a Drill UDF I’ve ever seen, so if you’ve been struggling with how to go about writing your own, the source code I’m presenting today maybe a good way to get some traction.
The case of the stolen candy hearts: Advanced date parsing in SQL from 2016-02-02
The other day I had 12 years of San Francisco crime data loaded in Drill and I wanted to answer the following question: Which days from recent years have the highest incidences of crime? As it turns out, this isn’t that difficult to accomplish, but it did add some new functions to my repertoire, so I thought I’d share the process with you. Once I got a hold of the SF crime download, I renamed it to a file with a ‘.
Reddit hates George Bush more than Vladimir Putin from 2016-01-29
Just as I promised, today I’m going to show off that nifty sentiment analysis UDF for Apache Drill that I discussed in the last article. Today’s data is once again provided by this awesome dump of Reddit submissions that date from 2006 up through last summer. Basically I just ran the sentiment analyzer function through submission titles, examining a selection of politicians that I thought Reddit might feel strongly about.
Writing a custom SQL function for sentiment analysis from 2016-01-28
In the world of data analytics a ‘sentiment analysis’ is any technique that attempts to represent the feelings of users in a somewhat quantitative way. Implementations of this idea vary, but one of the simplest ones involves giving individual words a numeric score according to the strength of the positive or negative the emotions that they elicit. For instance we might assign a score of -2.3 to the word ‘disappointment’ and a score of 1.
Securing SQL on Hadoop, Part 2: Installing and configuring Drill from 2016-01-25
Today we’re going to pick up where we left off in Part 1 of my two-parter about setting up a CDH cluster to perform secure SQL queries on an HDFS store. As you recall, last time we had just finished using Cloudera Manager’s wizard to finalize a Kerberos configuration, and all of the cluster services had come back online using the new security system so our basic HDFS cluster set-up was good to go.
Securing SQL on Hadoop, Part 1: Installing CDH and Kerberos from 2016-01-22
Introduction In the Dremio Blog we’ve talked about pairing Drill with HDFS before, but never with the emphasis on security that so often comes hand-in-hand with enterprise applications. So today’s post marks the beginning of a two part series explaining how to set up a cluster environment that enables secure SQL queries to data stored on HDFS. For this article’s test ‘hardware’ we’ll use six instances provisioned from Amazon’s EC2 service, while the core software components will be provided by Cloudera’s Hadoop system, CDH, paired with Ubuntu 14.
Drill plays the classics: Querying musical compositions with SQL from 2016-01-21
Today in ‘Wow, I never expected to use SQL for that!,’ I’m going to show how you can use Drill (along with a simple command line utility) to analyze musical compositions for sound and style characteristics. The music we’ll be analyzing is a selection of piano pieces written by various classical composers. In particular, we’ll be looking at representations of this music encoded into MIDI files created by one Bernd Krueger, who hosts them on his site: www.
Using a SQL JOIN on two CSV files from 2016-01-19
One of Drill’s great strengths is its ability to unite sets of information found in different places. In today’s article I’ll provide a quick example of this functionality by showing you how to you use Drill to quickly improve the interface to some found data. We’ll start with the interesting but strangely formatted business-centered census data that we used in the last post (available on this site as a 14.3 MB zip file).
SQL on Kudu quickstart from 2016-01-13
Kudu is a new (currently in beta) distributed data storage system that attempts to unify the strengths of HBase and HDFS systems. Although initially developed at Cloudera, the software has recently been inducted into the Apache Incubator program. This means there’s a reasonable expectation of Kudu becoming the next big Next Big Thing in terms of data storage. As the widely varied topics of this blog demonstrate, Drill devs aren’t very willing to let a potential source of valuable data slip by unnoticed, so of course they’ve been hard at work on a new storage plugin.
Old and busted: Teasing formerly-fashionable websites from Reddit data from 2016-01-03
Anyone who spends even a little bit of time on the Internet knows how fickle and volatile the cultural scene is. And there’s perhaps no greater exemplar of this volatility than Reddit. For good or bad, Reddit communities often serve as tastemakers for the Internet at large. If your content is visible on Reddit, chances are that things are going great for you. And if not, well, maybe not… The topic for today’s post is pretty simple—I’m just going to show off a cool analysis related to this observation.
Predicting International Space Station solar transits using built-in SQL math functions from 2015-12-21
If you’re an astronomy or spaceflight nerd (in my case: ‘check’ and check’) there’s a good chance you’ve seen one of these really cool photos people have been taking lately where they capture an image of the International Space Station as it makes a transit through the disc of the Sun. Here’s one that was recently (September ‘15) featured on NASA’s popular Astronomy Picture of the Day site:
Tuning Parquet file performance from 2015-12-13
Today I’d like to pursue a brief discussion about how changing the size of a Parquet file’s ‘row group’ to match a file system’s block size can effect the efficiency of read and write performance. This tweak can be especially important on HDFS environments in which I/O is intrinsically tied to network operations. (Note: If you’d first like to learn about Parquet in a less ‘nuts and bolts’ manner, let me recommend this post in which I provide a simple demo of the format using Apache Drill).
Examining U.S. radio antenna locations with Tableau Desktop 9 from 2015-12-10
For today’s demo I’m going to parse a data.gov file that contains U.S. radio tower locations into a cool map figure. I’ll be using Apache Drill to handle reading in and manipulating the geographic coordinate fields in the file, and Tableau Desktop 9 to provide an easy way to generate a map based on the data. Attaching Tableau to Drill data in Microsoft Windows is pretty easy. Assuming you already have Drill (see this article) and Tableau installed, you just need to download and install this ODBC driver from MapR (choose the 64-bit .
Installing Apache Drill on Microsoft Windows from 2015-12-08
Many of my previous articles on the Dremio Blog have assumed that you’d be using Apache Drill from within some flavor of unix (namely either Linux or OS X), but it’s completely possible to install Drill on Microsoft Windows as well. In this post I’ll cover how to install and run a single-machine instance of Drill on Windows Server 2012 R2. Just like the unix versions, you’ll need to have the Java Development Kit installed to run Drill.
Performing basic statistics operations on Twitter data from 2015-12-03
I’ve used Drill’s AVG() function quite a bit previously, but in this post I’d like to point out some other basic statistics operations you can do in Drill. For my trial data I’m going to reference my trusty corpus of nearly 52,000 tweets pulled from Twitter’s streaming API, and with it I’m going to try to get a sense of how long people prefer to make their Twitter profile’s ‘real name.
Querying dates using SQL in a JSON document from 2015-12-03
In this article we’re going to use Apache Drill to bend TIME ITSELF to our every whim. Or wait, not TIME ITSELF. I meant, er… TIMESTAMPS. Don’t worry—that’s still pretty useful. So let’s say I have the results of a Twitter search in raw JSON and would like to know the range of dates that the statuses span. For this we’ll need the timestamp information, which is held in the ‘created_at’ key of each object.
White mana cards are cheap and stupid: Querying a database of every Magic card ever made from 2015-12-03
With over 10,000 unique cards and a decade of success behind it, Magic: The Gathering is still as relevant as ever to fans of the collectible card game genre. But how well has Magic managed to stick to its core design philosophy over the years? Today we’ll examine this question by analyzing a JSON dump of the text found on every card ever produced. For the uninitiated (and those who haven’t a touched a card since their days on the middle school cafeteria circuit), actions in a Magic game are enabled by a resource called “mana” that comes in five fundamental varieties: White, Blue, Black, Red, and Green.
Configuring Drill's storage plugins with the REST API from 2015-12-02
The Web Console of Apache Drill is an intuitive configuration tool, but it’s easy to imagine scenarios where connecting to a Drill instance via a browser isn’t desirable or even possible (e.g. within a script or in a high security environment). So if you find yourself needing to set up a Drill storage plugin in such a situation, what should you do? Well, in cases like this Drill’s built-in REST API comes to the rescue.
Setting up an SQL interface to a sharded MongoDB from 2015-12-02
Salutations, data fans! Today I’d like to demonstrate how to use Apache Drill to query a sharded MongoDB database running on a computer cluster. To begin with, I’m going to assume you already know how to do a sharded Mongo setup (the official documentation is very helpful), or you already have access to a system with a sharded MongoDB in place. With this out of the way, I can focus on the Drill setup.
Bless this mess: Working with complicated JSON structure in SQL from 2015-12-01
Data you find on the web comes in all formats. From simple CSV to some really creepy-crawly JSON. The data provided by the Twitter API to describe a tweet unfortunately falls within the latter category. In this article I’ll use Drill to provide an SQL interface to a MongoDB collection of about 52,000 tweets pulled from the Twitter streaming API. As I alluded to, tweet data presents itself in a way that manifests a lot of nested elements, and documents in the collection follow this JSON structure:
SQL and Parquet: A simple demo from 2015-12-01
For data wranglers wishing to store large amounts of text or numeric entries in an efficient fashion, the Apache Parquet file format makes for an obvious choice. As a storage medium, Parquet has two important benefits afforded by its columnar structure: 1.) Access to data columns can be made on an ‘as needed’ basis, increasing the overall speed of queries, and 2.) Since all values in a column are serialized and compressed together, Parquet files take up much less room than similar plain text or row-wise compressed files.
Finding corrupt JSON records in MongoDB from 2015-11-30
A few weeks ago I mangled a MongoDB collection I was working with by importing a file I’d corrupted with a manual edit. The change I introduced was small, but serious enough to cause my Drill queries to fail. In this post I’m going to simulate this situation with an intentionally besmirched JSON file imported into MongoDB. Today’s data set comes from some tweets that I pulled from Twitter’s streaming API.
SQL queries on CSV files with column headers from 2015-11-23
A new feature of Drill 1.3.0 is the ability to query CSV files using the column names listed in the file’s header. As an example, let’s look at this CSV formatted file of countries that participated in the 2012 Summer Olympics. Before we begin, rename the file so that it ends in ‘.csvh’ instead of ‘.csv’. The extension change tells Drill to parse the header of the CSV file, which means we can use the in-file identifiers for column names when we construct our queries.
Bootstrapping an interactive SQL environment on Amazon EMR from 2015-11-17
Amazon’s Elastic MapReduce (EMR) service is a great way to create on-demand Hadoop clusters for ad hoc data analysis. The web based interface for the service is simple to use, and at its most basic level creating an EMR cluster is as easy as specifying the number and type of machines to use. But you can also instruct each computer to execute something called a ‘bootstrap action’ just before coming online.
How to query S3 data using Amazon's S3a library from 2015-11-16
Starting with version 1.3.0, Drill has the ability to query files stored on Amazon’s S3 cloud storage using the S3a library. This is important, because S3a adds support for files bigger than 5 gigabytes (these were unsupported using Drill’s previous S3n interface). To enable Drill’s S3a support, first edit the file conf/core-site.xml in your Drill install directory, replacing the text ENTER_YOUR_ACESSKEY and ENTER_YOUR_SECRETKEY with your AWS credentials. <configuration> <property> <name>fs.s3a.access.key</name> <value>ENTER_YOUR_ACCESSKEY</value> </property> <property> <name>fs.
Querying Google Analytics JSON with a custom SQL function from 2015-10-30
Last time I wrapped up by showing you how to look at Google Analytics JSON using a nested SQL query in Apache Drill. This approach is fine, but by implementing a custom function in Drill we can talk to the same data using a much simpler query. To get started making user defined functions (UDFs), you first need to download and install Apache Maven. Once you have the tarball, move it to your home directory, and then do:
Using SQL to interface with Google Analytics data stored on Amazon S3 from 2015-10-20
Introduction In the realm of web analytics, Google’s presence remains formidable. And it’s no surprise—as a free service with a brain-dead easy setup and a wide range of collected user information, Google Analytics has a lot going for it. But if you’d like to move away from the standard Analytics web interface, what are your options? The intended audience for this article is those who might wish for a little more flexibility with their Google Analytics data.
Running SQL Queries on CSV Files Using Apache Drill from 2015-10-13
If you find a large data store on the web that you’re interested in, chances are good that it’ll be available for download as a dump to a CSV file. The CSV format definitely isn’t the most sophisticated data storage method around, but it’s common, relatively intuitive for humans to read, and simple to import and export from spreadsheet software like Excel and LibreOffice. Many of us, however, might want to quickly integrate a CSV data set with an existing database system (like Hadoop or MongoDB) that’s already in place.
Bringing SQL to MongoDB with Apache Drill from 2015-09-20
If you’re reading this post, then chances are you’ve found yourself wishing you could talk to your MongoDB data store using a standard SQL interface. There are lots of good reasons to do this: Maybe you’d like to port some existing code with a minimum of fuss, or maybe you’re dealing with a lot of different data storage types and would like to unify them all under a single interface (hint: Apache Drill is fantastic for this).