A while back, we made a list from A to Z of a few of our favorite things in big data and data science. We have made a lot of progress toward covering several of these topics. Here’s a handy list of these write-ups (as well as an added bonus on one of those topics at the end).
A – Association rule mining: described in the article “Association Rule Mining – Not Your Typical Data Science Algorithm.”
C – Characterization: described in the article “The Big C of Big Data: Top 8 Reasons that Characterization is ‘ROIght’ for Your Data.”
H – Hadoop (of course!): described in the article “H is for Hadoop, along with a Huge Heap of Helpful Big Data Capabilities.” To learn more, check out the Executive’s Guide to Big Data and Apache Hadoop, available as a free download from MapR.
K – K-anything in data mining: described in the article “The K’s of Data Mining – Great Things Come in Pairs.”
L – Local linear embedding (LLE): is described in detail in the blog post series “When Big Data Goes Local, Small Data Gets Big”.
N – Novelty detection (also known as “Surprise Discovery”): described in the article “Outlier Detection Gets a Makeover - Surprise Discovery in Scientific Big Data.” To learn more, check out the book Practical Machine Learning: A New Look at Anomaly Detection, available as a free download from MapR. As an added bonus (and because Surprise Discovery is my most favorite of all data science things), we provide below a few more insights into this all-important discovery method in big data analytics applications.
P – Profiling (specifically, data profiling): described in the article “Data Profiling – Four Steps to Knowing Your Big Data.”
Q – Quantified and Tracked: described in the article “Big Data is Everything, Quantified and Tracked: What this Means for You.”
R – Recommender engines: described in two articles: “Design Patterns for Recommendation Systems – Everyone Wants a Pony” and “Personalization – It’s Not Just for Hamburgers Anymore.” To learn more, check out the book Practical Machine Learning: Innovations in Recommendation, available as a free download from MapR.
S – SVM (Support Vector Machines): described in the article “The Importance of Location in Real Estate, Weather, and Machine Learning.”
ZZ – Zero bias, Zero variance: described in the article “Statistical Truisms in the Age of Big Data.”
N is for Novelty Detection
Finally, we take another look here at N – Novelty Detection, which goes by many other names: outlier detection, anomaly detection, deviation detection, and (my favorite) surprise discovery! The goal of novelty detection methodology is to find the rare thing in your data collection—the thing that is different from the rest, and the features that occur in your data that are outside the bounds of your normal (and/or statistical) expectations.
Outliers generally fall into one of four broad categories: (1) statistically explainable data points that are several standard deviations from the mean of the data distribution (which you would not expect in small data collections, but which will start popping up within big data collections that have millions or billions of data points); (2) data quality problems (hence, these outliers are important indicators that some data cleaning is required); (3) data pipeline errors (hence, these outliers indicate that something is wrong with the processing, wrangling, or analytics tools that you are using); or (4) discoveries (i.e., these types of outliers are truly the novel, interesting, unexpected, surprising, and potentially most insightful features in your big data collection—the proverbial “needle in the haystack!”).
Note that novelty detection also applies to “interesting subgraphs” within a graph (network) database, such as social networks. A well-documented historical example of this is the anomalous (unexpected) network connections among the 9-11 terrorists.
Some people (including this author) would say that novelty detection is the best and most sought-after outcome of data science applications on big data. We hope and anticipate that very large data collections carry enormous potential for surprising discoveries. Such discoveries will span the full spectrum of statistics: ranging from rare one-in-a-million (or one-in-a-billion) types of objects or events (novelties), to the complete statistical specification of many classes of objects (based upon millions of instances of each class), as well as every use case in between those two ends of the statistical spectrum of discovery.
The growth in data volumes from all aspects of science, government, healthcare, retail, financial services, telecommunications, etc. (including data from social media, sensors, monitoring systems, and simulations) requires increasingly more efficient and more effective knowledge discovery and extraction algorithms. These algorithms are often applied in big data computing environments, such as Hadoop clusters. Among these algorithms are a large variety of anomaly detection methods (for outlier/novelty/surprise discovery). Novelty detection algorithms enable data scientists to discover the most “interesting” objects, events, and behaviors embedded within large and high-dimensional datasets. These items are often labeled the “unknown unknowns.”
Effective novelty detection in data streams (including the Internet of Things) is essential for the rapid discovery of potentially interesting and/or hazardous events. Emerging unexpected conditions in hardware, software, or network resources need to be detected, characterized, and analyzed as soon as possible for obvious system health and safety reasons. Similarly, emerging unusual or anomalous behaviors and variations in customer behaviors, social events, mechanical devices, transportation systems, financial networks, natural environments, etc. must also be detected, characterized, and assessed promptly in order to enable rapid decision support in response to such events.
We have developed a new algorithm for novelty detection (KNN-DD: K-Nearest Neighbor Data Distributions) that defines an outlier as a point whose behavior (i.e., whose location in parameter space) deviates in an unexpected way from the rest of the data distribution. Our algorithm evaluates the local data distribution around a test data point and compares that distribution with the data distribution within the sample defined by its K nearest neighbors. Since this KNN-DD thing is a bit sciencey, if you’re a practical-minded reader who is interested in Novelty Detection (and Surprise Discovery), please download a copy of a new book from MapR titled Practical Machine Learning: A New Look at Anomaly Detection, and let the discoveries begin!