The second publication in the O’Reilly Practical Machine Learning series, subtitled A New Look at Anomaly Detection by Ted Dunning and me, is being released this week. In the previous book, which focused on practical approaches to recommendation, we started with the idea that everyone thinks “I want a pony”. Here in the second book, what we want is to find the outlier, the zebra in a herd of ponies, the fish swimming against the school of fish, the rare event. In other words, the goal is to explore how to build a practical machine learning system that can detect anomalies. And in the spirit of the series, we do this by taking into account what is needed to make this work in practical settings.
Why use anomaly detection? The need for this methodology is widespread and growing. Anomaly detection provides a valuable solution to problems such as security attacks, how to track abnormal changes in website traffic, provide appropriate alerts for medical device readings, how to monitor fluxations in the performance of manufacturing equipment or a wide variety of other sensor data in the rapidly expanding Internet of Things. In these examples, you won’t know exactly what the outlier will be, so you have to play the detective.
Anomaly detection is about finding what you don’t know to look for.
What would you do to start building a detector? Although there’s a wide range of situations and settings that all differ in the exact approach to an anomaly detection solution, they have in common the starting point: In order to find what is different or anomalous, you must first figure out what is normal. And that can be a bit more challenging that you might think, especially in complex situations but also in surprisingly simple ones. To help with this challenge, we provide the reader with simple analogies that lead up to an understanding of how to build an adaptive, probabilistic model to discover “normal” and how to take the next step, to discover what is anomalous. Remember, you won’t know the exact description of the rare events for which you are watching. Instead, you’ll define the anomalies in contrast to what is normal.
Here’s a challenge: In this hypothetical situation gray represents the normal pattern, the black line is a simple model of what is normal, and the x’s are anomalies. But at what value would you set the threshold for alerts without having too many false positives?
We describe some new approaches that improve a range of different types of anomaly detectors, from commonly used forms such as a manually set threshold for alerts to more complex approaches including detectors for sporatic events.
For example, the common threshold-model as it is often implemented has some serious problems. A good first step to improve even this simple type of detector is to change the way that the threshold is set. We explain how to do that with a new approach known as the t-digest, which was developed and contributed to open source by MapR Chief Application Architect and co-author Ted Dunning. The t-digest is a way to accurately estimate extreme quantiles, and this has great use in setting the threshold for anomaly detection appropriately. The t-digest has been taken up by Apache Mahout, by Elastic Search and others. It is available on Github, and we describe in detail how to use it.
We also talk about the practical trade-offs between anomaly-driven and budget-driven projects and give useful pointers as to how to build adaptive probabilistic models for anomaly detection in seasonally changing web site traffic, for sensor data in systems such as public water systems and in phishing attacks on a secure website.
Practical Machine Learning: A New Look At Anomaly Detection is available to download as an ebook. If you are going to Hadoop Summit in San Jose on June 3 -5, stop by the MapR booth for a free print copy. The authors will be signing copies at 4:00 - 4:30 pm on Tuesday and on Wednesday of Hadoop Summit.
You can hear a presentation on anomaly detection by Ted Dunning at Hadoop Summit on Wednesday June 4 at 2:35pm called “How to Find What You Didn’t Know to Look For: Practical Anomaly Detection”.
Ted is also giving a fun presentation at 5:25pm Wednesday at Hadoop Summit called “Hadoop and R Go to the Movies: Visualization in Motion”.
Ted’s third presentation at Hadoop Summit is on Thursday at 2:10 pm: “How to Determine Which Algorithms Really Matter”.
And if you missed the first publication in the series, you can download a copy of Practical Machine Learning: Innovations in Recommendation.
Follow authors on Twitter: