Haystacks and Jet Packs – How Hadoop Changes Everything

I thought for sure we'd have flying cars and jet packs by the 21st Century. It turns out that the new tools for data analysis are far more important.

Only a few short years ago, the best tools available could only analyze small representative samples of big data sets. To figure out who will win the election, ask randomly selected people. To establish a correlation between behavior and outcome, do a study on a small group. This is an excellent way to find the common cases, but does nothing to turn up interesting anomalies--it's like looking for a needle in a haystack by searching one percent of the hay--or one thousandth of one percent.

The new tools for analyzing Big Data can search the entire haystack--and are scaling up to deal with bigger and bigger haystacks. You may have heard this by now, but Big Data is big. How big? The Large Hadron Collider produces over a petabyte per month. The Square Kilometre Array radio telescope, which is being developed over the next decade or so, will produce at least an exabyte per day of astronomical data. According to Google CEO Eric Schmidt, every two days the world produces as much data as it did in total up until 2003. The amount of data produced by the world nearly doubles every year--and the world produced one zettabyte in 2010 (according to IDC).

It's not just scientists, search engines, and social media sites that need to process this much data. Retailers, merchants, and credit card companies are looking for patterns in massive flows of transaction, click, and sentiment data to fine-tune marketing, prevent fraud, and optimize the customer experience. Here's an example. Let's say there's a transaction on your credit card indicating that you bought a tank of gas. An hour or so later, another transaction shows that you ate lunch. If the restaurant is within 30 miles or so of the gas station, then the speed you would have had to travel--the card velocity--is a plausible 30 miles per hour. But if the restaurant is, say, 500 miles away, then the card velocity is a good indicator of fraud, unless you also own a jet pack. Only by processing data about every single transaction and its location can valuable insights like card velocity be gained.

The New York Times reported that data measurement could be as important as the invention of the microscope. By taking a snapshot of the data at various points in time, of course, you can tune the microscope by preserving a history of previous results, which makes it possible to further refine data analysis models.

This is the true beginning of the 21st Century. The possibilities that will grow from this new technology can barely be imagined today. So as you start to search for the needle in your haystack or even the needle in your jet pack, we'd love to hear the ways MapR made it possible for you.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free