Making Sense of it All

Building a well-designed, reliable and functional big data application that caters to a variety of end-user latency requirements can be an extremely challenging proposition. It can be daunting enough to just keep up with the rapid pace of technology innovation happening in this space, let alone building applications that work for the problem at hand. “Start slow and build one application at a time” is perhaps the most common advice given to beginners today. However, there are certain high-level architectural constructs that can help you mentally visualize how different types of applications fit into the big data architecture and how some of these technologies are transforming the existing enterprise software landscape.

Lambda Architecture

Lambda Architecture is a useful framework to think about designing big data applications. Nathan Marz designed this generic architecture addressing common requirements for big data based on his experience working on distributed data processing systems at Twitter.

Some of the key requirements in building this architecture include:

  • Fault-tolerance against hardware failures and human errors
  • Support for a variety of use cases that include low latency querying as well as updates
  • Linear scale-out capabilities, meaning that throwing more machines at the problem should help with getting the job done
  • Extensibility so that the system is manageable and can accommodate newer features easily


Overview of the Lambda Architecture

The Lambda Architecture as seen in the picture has three major components.
  1. Batch layer that provides the following functionality

    1. managing the master dataset, an immutable, append-only set of raw data

    2. pre-computing arbitrary query functions, called batch views.

  2. Serving layer—This layer indexes the batch views so that they can be queried in ad hoc with low latency.

  3. Speed layer—This layer accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only.

    Each of these layers can be realized using various big data technologies. For instance, the batch layer datasets can be in a distributed filesystem, while MapReduce can be used to create batch views that can be fed to the serving layer. The serving layer can be implemented using NoSQL technologies such as HBase, while querying can be implemented by technologies such as Apache Drill or Impala. Finally, the speed layer can be realized with data streaming technologies such as Apache Storm or Spark Streaming.

Up to now, the description of the Lambda Architecture here makes use of the basic capabilities that are pretty much common to all distributions powered by Hadoop. There are somethings you can do, however, with a MapR cluster that improves the basic operation of the Lambda architecture.

For instance, most Storm topologies avoid the use of much persisted state. This is fast and easy, since tuples can be acknowledged as soon as their effect has been impressed on memory. In the Lambda Architecture, this is not supposed to be too big of a deal, since any in-memory state that is lost due to software version upgrades or failures will be repaired within a matter of hours or so as the affected time window ages out of the real-time part of the architecture.

When you have a MapR cluster underneath a Lambda Architecture, however, you can do a bit better than this, so that the times that failures are visible drops to seconds instead of hours.

One way that this works is that MapR allows high-speed streaming data to be written directly to the Hadoop storage layer, while allowing stream-processing applications such as Storm or Spark Streaming to run as an independent service within the cluster. The processing application now becomes more of a subscriber to the incoming data feed. If a failure occurs, and the original application goes down, a new instance of the application can pick up the data stream within seconds of where the original application instance dropped off. An added advantage of this architecture is the availability of streaming data for batch as well as the serving layers.

In addition, individual processing elements can delay their acknowledgement of incoming tuples until they have logged the tuple to a log file in the distributed file system. This log file need only persist until the state of the bolt is either persisted or a summary is sent down stream. At that moment, a new log is started.

In the case of failure or orderly exit of a topology, the new version of the bolt can read this log and reconstruct the necessary state of the bolt very quickly. Once the log is read, tuples coming from the spout can be processed as if nothing ever happened. Since all tuples that arrived *after* the last record in the log have not been acknowledged, the spout will replay them so the bolt will get a complete set of tuples.

For more detailed information on Lambda Architecture, read: