Last week we went to the Strata + Hadoop World 2012 conference in New York. The conference is for everything around Big Data: cutting edge technologies, data scientists, business intelligence and so on. The conference itself is also pretty big: around 3500 participants, seven tracks, countless sponsors and lot of coffee.
Not even a full month after the Strata Conference in London, the conference in N.Y. made even clearer some of the trending topics in the Big Data world, such as the trend to go from batch-processing to interactive analysis. In the following review I will share some of my learnings at the conference from a technological perspective.
From Batch Processing to Ad Hoc Analysis
There are multiple solutions popping up that allow ad hoc analysis of data almost instantly instead of doing batch-processing via mapreduce.
Cloudera revealed Impala which allows ad hoc analysis of data and is around 4-30 times faster than Hive over mapreduce. It looks a lot like Apache Drill, but it is much further developed, according to Henry Robinson.
Another tool that stands somewhere between mapreduce and ad hoc analysis is Akamais Trecul, which has it’s own expression language like Pig, but tries to outperform Hive for many cases in ad hoc analysis. Unlike Hive, Trecul’s code gets compiled and is implemented in C++ and as such, it is also a lot faster, especially for complex queries.
New NoSQL Stores
There are some cases where Hadoop just doesn’t fit well. Metamarkets, for example, just opensourced Druid, which is an distributed in-memory datastore for realtime analytics. If you need web-suitable queries on large data-volumes where horizontal scaling is desired, this store is for you.
Another interesting concept is the Datomic database, which has two main goals: consistency and time-based facts. It achieves these goals by storing data in an immutable fashion. This model allows some exciting features such as ACID-transactions and easy caching while being elastic scalable. Unfortunately the database isn’t open source, however a free edition is available for development and production.
New Kinds of Batch Processing
I am happy to see that Hadoop’s YARN is starting to be adopted. Cloudera gave a presentation where they implemented a new paradigm, the so called Iterative Reduce on top of YARN. However they stated that the documentation of YARN is still lacking and it is pretty hard to write applications on top of YARN, as they needed at least 1000 LOC just to deal with it. We’ll see how this will evolve in the future.
Distributed Graphs Analytics
Graphs play an important role today, as they are omnipresent and are essential to Data Mining and Machine Learning. Yet it is pretty hard to do distributed graph analytics. Intel will opensource their GraphBuilder, which allows you to construct large-scale graphs on top of Hadoop and process them with GraphLab, a distributed computation framework for Data Mining and Machine Learning.
HBase going Enterprise
HBase 0.96 will include a lot of enterprise-ready features (some of the features are also available in 0.94 and even 0.92). Just to name a few:
- Failure detection and recovery in seconds, it’s even possibile to do writes during recovery
- New metrics available, such as per Column Family, per Region metrics and more
- Wire Compatibilty and Online Schema Change (experimental!) allow you to reduce planned downtimes
- Master-Master Replication
- Table Snapshots
- User Authentication and Authorisation
At the Strata + Hadoop World Conference N.Y. there were so many tracks, so many talks, so many people and so many companies. It is just amazing to see how the Big Data World evolves. It looks like a bright future and the right way to continue working in the Big Data field.