Getting Started with Apache Spark

Spark In-Depth Use Cases

In a similar way to Chapter 5, we will be looking into new and exciting ways to use Spark to solve real business problems. Instead of touching on simpler examples, it is time to get into the details. These are complicated problems that are not easily solved without today's current big data technologies. It's a good thing Spark is such a capable tool set.

The first use case will show you how to build a recommendation engine. While the use case focuses on movies, recommendation engines are used all across the Internet, from applying customized labels to email, to providing a great book suggestion, to building a customized advertising engine against custom-built user profiles. Since movies are fun to watch and to talk about and are ingrained into pop culture, it only makes sense to talk about building a movie recommendation engine with Spark.

The second use case is an introduction to unsupervised anomaly detection. We will explore data for outliers with Spark. Two unsupervised approaches for detecting anomalies will be demonstrated; clustering and unsupervised random forests. The examples will use the KDD'99 dataset, commonly used for network intrusion illustrations.

The final and perhaps most intellectually stimulating use case attempts to answer a half-a-century old question, did Harper Lee write To Kill a Mockingbird? For many years, conspiracy buffs supported the urban legend that Truman Capote, Lee's close friend with considerably more literary creds, might have ghost-authored the novel. The author's reticence on that subject (as well as every other subject) fueled the rumors and it became another urban legend. This in-depth analysis digs into this question and uses Spark in an attempt to answer this question once and for all.