HBase Split Visualisation – Introducing Hannibal!

Out-of-the-box, HBase uses the built-in automatic split functionality to split a large region. Essentially, after a major compaction, there is for every region one single StoreFile per column family. StoreFiles save the data using the HFile file format internally. The region server splits the StoreFile that exceeds hbase.hregion.max.filesize  in the middle (on block boundaries) and creates two new StoreFiles out of the one.

Due to our key design and the irregular growth of data, we get a lot of regions for a table where the region size ranges from 200MB to 40GB. This high region count makes things noticeably slower. The recommended number of regions per region server ranges from 20 to low-hundreds.

As stated in a previous post one way to deal with this issue is to disable auto-split and manage the splitting manually. To turn off automatic splitting just set hbase.hregion.max.filesize to a high value such as 40GB or even higher. That’s what we did.

But, how do I monitor my region growth and distribution on the cluster?

And even more importantly, how well are the regions split (size) for each table?

What is going on in particular regions? How long do compactions take?

As none of the tools shipped with HBase or Ganglia (the monitoring tool we use) could help us with these questions, we decided to build our own solution: Hannibal.

Hannibal logoSo finally, we came up with Hannibal:  a web based tool to visualise region sizes, their distribution and the compaction history. The visualisations help you to make decisions about manual splitting.

Screenshot of graph showing Region distribution on the server

Region distribution on the server

Graph showing the region splits

Region splits

Screenshot showing graph of region history

Region history

Hannibal is open source and implemented in Scala. In it’s current version it supports HBase 0.90. Support for versions > 0.90 is planned and will be added soon. Please have a look at the github project, install Hannibal, play around with it and let us know what you think, what you like, what you don’t, or what additional features you would like to see.

My talk “Operating HBase: Things you need to know” at ApacheCon Europe will cover the above topics and introduce Hannibal as well.

Which HBase region splitting mechanism do you use and how do you solve the problems that arise from it?

Update:

We just uploaded a Video Tutorial:

Let me hear what you think.

Update 2:

You got feedback or need any help? Follow Hannibal on Twitter or subscribe to the mailing list.

  1. The tool looks GREAT, and it addresses the ‘next level’ of monitoring on top of (for example) Cloudera Manager. CM provides monitoring up to the Region Server level, but you are isolated from each and every region. For this CM sends you to the open-source monitoring screens.

    Another great feature to see here, would be the number of Requests/min hitting each region. This would allow to see if any regions are ‘hotspots’ in terms of reads and writes.

    But overall the tool is very effective, and a great addition to the production support arsenal!

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">