We are witnessing a paradigm shift from batch based data processing to real-time data processing using the Hadoop framework. Despite this progress it is still a challenge to process web-scale data in real-time. A lot of technologies can be used to create such a complete data processing system – but to choose the right tools, to incorporate and orchestrate them is complex and daunting.
Nathan Marz defines the most general data system as a system that runs arbitrary functions on arbitrary data. This leads to the following equation “query = function(all data)” which is the basis of all data systems. The Lambda Architecture defines a clear set of architectural principles for building robust and scalable data systems that obey the equation above. He is also currently writing the book “Big Data – Principles and best practices of scalable realtime data systems”.
The Lambda Architecture is based on three main design principles:
- human fault-tolerance – the system is unsusceptible to data loss or data corruption because at scale it could be irreparable.
- data immutability – store data in it’s rawest form immutable and for perpetuity.
- recomputation – with the two principles above it is always possible to (re)-compute results by running a function on the raw data.
In general the Lambda Architecture is composed of three layers: the batch layer, the serving layer and the speed layer.
The batch layer contains the immutable, constantly growing master dataset stored on a distributed file system like HDFS. With batch processing (MapReduce) arbitrary views – so called batch views are computed from this raw dataset. So Hadoop is a perfect fit for the concept of the batch layer.
The job of the serving layer is to load and expose the batch views in a datastore so that they can be queried. This serving layer datastore does not require random writes – but must support batch updates and random reads – and can therefore be extraordinarily simple (candidates could be ElephantDB or Voldemort).
This layer deals only with new data and compensates for the high latency updates of the serving layer. It leverages stream processing systems (Storm, S4, Spark) and random read/write datastores to compute the realtime views (HBase). These views remain valid until the data have found their way through the batch and serving layer.
To get a complete result, the batch and realtime views must be queried and the results merged together .
The Lambda Architecture is the first approach that handles the complexity of Big Data systems by defining a clear set of principles. Sentric adopted these architectural principles (or at least part of them) for our customers as they are great approach that can be applied to any Big Data system. Specifically immutability, human fault-tolerance and recomputation are really nice principles that can be easily adopted with the Hadoop platform.
Depending on realtime requirements, often enough the speed layer is not even needed. If omitted, it makes the whole system even less complex, but the beauty of the Lambda Architecture is that the speed layer can be integrated later on without a huge hassle.
In part II of our series we’ll write about the batch layer with an example case. So stay tuned!