This is the first installment of a two-part series on the value of doing small data analyses on a big data collection. In this first article, we describe the applications and benefits of “small data” in general terms from several different perspectives. In Part 2 of this series, we’ll spend some quality time with one specific algorithm (Local Linear Embedding) from a broader class of algorithms (Manifold Learning) that enable local subsets of data (i.e., small data) to be used in developing a global understanding of the full big data collection.
We often hear that small data deserves at least as much attention in our analyses as big data. While there may be as many interpretations of that statement as there are definitions of big data, there are at least two situations where “small data” applications are worth considering. I will label these “Type A” and “Type B” situations.
In “Type A” situations, small data refers to having a razor-sharp focus on your business objectives, not on the volume of your data. If you can achieve those business objectives (and “answer the mail”) with small subsets of your data mountain, then do it, at once, without delay!
In “Type B” situations, I believe that “small” can be interpreted to mean that we are relaxing at least one of the 3 V’s of big data: Velocity, Variety, or Volume. If we focus on a localized time window within high-velocity streaming data (in order to mine frequent patterns, find anomalies, trigger alerts, or perform temporal behavioral analytics), then that is deriving value from “small data.”
If we limit our analysis to a localized set of features (parameters) in our complex high-variety data collection (in order to find dominant segments of the population, or classes/subclasses of behavior, or the most significant explanatory variables, or the most highly informative variables), then that is deriving value from “small data.”
If we target our analysis on a tight localized subsample of entries in our high-volume data collection (in order to deliver one-to-one customer engagement, personalization, individual customer modeling, and high-precision target marketing, all of which still require use of the full complexity, variety, and high-dimensionality of the data), then that is deriving value from “small data.”
In both of the above situations (Type A and Type B), you might say that “less is more.” I would say, no matter which situation you are in, “When big data goes local, small data gets big!”
There is another way that focusing on small data can yield big results. This additional “localization” approach is applicable in the Type B situation. Specifically, this new approach focuses on and analyzes the properties of our data within localized subsets of the total data collection in a systematic sequential manner that moves across the big data, one “small data” step at a time. We can even perform many of these local “small data” analyses in parallel, eventually blanketing massive swaths of our big data landscape, using Hadoop clusters. From this approach we can infer the global properties of our entire data set.
The ability to infer global understanding from local views of the knowledge base is a powerful ally in the battle to manage, wrangle, and extract insights from massive data collections. We will examine the new localization approach (Local Linear Embedding) in Part 2 of this series of articles. But first we offer here some words about the general importance of location in machine learning and analytics.
In an earlier article, “The Importance of Location in Real Estate, Weather, and Machine Learning,” we discussed the value of location, location, location in big data analytics and in data mining. Applying one meaning of location, we understand that geospatial context (customer location or sensor location) is an invaluable feature in many data science modeling efforts—location provides significant contextual meaning to the behavior of the customer, the system, or the environment that is being monitored. Applying another meaning of location, we understand that the location of different data objects within our measured feature space provides major insights into the classification and categorization of those objects.
Location information yields powerful contextual meaning to individual entries in our data collection (e.g., customers, vehicles, sensors, weather, traffic, natural or human-caused disasters, social media content, etc.). Focusing on an object’s location and what is local to that location is (once again) a very big deal in big data analytics projects across many different domains (e.g., health, security, transportation, finance, law enforcement, weather, climate, and more).
In part 2 of this series “When Big Data Goes Local, Small Data Gets Big,” we’ll take a deeper dive into Local Linear Embedding as we explore the wonderful world of Topological Data Analysis and how it is becoming a powerful new discovery method for finding patterns, relationships, and correlations in large complex data sets.