Real-time Learning: The Quick Without the Dirty

When people sit down to build a real-time big data reporting system, it is very common that compromises creep into the design. These compromises result in a “quick and dirty” analysis – the thought being that in order to get rapid results, you must give up accuracy or consistency or even any notion of what failure modes might exist. But Ted Dunning says that to get “quick” you don’t have to settle for “dirty.”

Would you like to be able to analyze data seamlessly between up-to-the-minute real-time reporting and long-term aggregation, without the need for reprocessing of temporary real-time estimates? And would you like to do that accurately and with a simple architecture? Would you like it if your CEO doesn’t find any more nasty discrepancies in your metrics?

Last week at Berlin Buzzwords 2013, MapR’s Ted Dunning showed how to do this with both metrics and with many forms of machine learning in his fourth #bbuzz talk titled “Real-time Learning for Fun and Profit,” presented to a packed room.
Quote from Ted Dunning’s talk regarding the challenge of accurately combining long-term aggregation with real-time reporting: “It’s not a problem. It’s an opportunity.”
Interest in machine learning is widespread and growing. This talk addressed that interest by looking at the real-time and long-time transition in the context of learning models.

Several key MapR features make it possible for the approach Ted described to be incredibly simple. These features include NFS access to the MapR distributed storage, as well as reliable, small footprint MapR snapshots. Under the covers, this approach employs a combination of replay logs, aggregation checkpoints, and snapshots to implement a real-time system with an analysis horizon from now to years in the future.
Figure: The strengths of Hadoop for batch processing and Storm for real-time analysis can be blended to provide a seamless view that includes real-time and long-time elements.
What does this approach mean to the business user?

If you need to collect and react to data as it arrives but also need to store data over a long time frame, this approach may help you. Traditionally, it has been difficult to do this accurately and yet keep up-to-the-instant in reporting. Ted’s approach applied via a MapR system is exact, correct and consistent as analysis moves smoothly from real-time to long-time.

The example that Ted focused on at Buzzwords was the problem of maintaining simple counts, such as how many page views there are for a particular site, but this same approach can be applied to any problem involving associative aggregations. This includes unique counts (e.g. how many unique visitors to a web site), finding heavy hitters or trending topics (things getting the highest number of hits) and even the co-occurrence counting required by recommendation engines.

If you would like to know about how this approach could improve your business metrics and machine learning efforts, contact Ted at MapR or view his slides.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free