As different systems pop up that try to make ad hoc analysis of big data sets easy and fast, I thought it was time to make a comparasion between Cloudera’s Impala and other projects for ad hoc anlysis on HDFS. But it turned out that there are simply no alternatives. But let me dig further into this.
When I got on the idea for the comparison, there was at least one project floating around in my mind to compare with, namely Apache’s Incubator Project: Drill. So let’s start with Drill: What has it got right know? It has a very extensive proposal, having lots of features on the list. All that’s implemented right now however, as Henry Robinson pointed out, is some parser code. I had to look that up for myself and the code-repository confirmed that it’s way to early for Drill to be a competitor for Impala.
So what alternatives are there? We could dig further into some mapreduce abstractions, such as Twitters Scalding, Cascalog or Akamai’s Trecul. But I don’t think we can call this ad hoc anlysis anymore, and performance cannot compete since they are all just spanning mapreduce jobs. Trecul looks a bit more promissing in performance-terms, at least for very complex queries there may be a chance, but it’s installation is a hastle.
So, in my opinion, Clouderas Impala is the only solution right now, or does anybody know of a possible competitor?
For information of Impala I recommend watching the following interview from Strata + Hadoop World with the lead developer: Marcell Kornacker.