When Big Data Goes Local, Small Data Gets Big - Part 2

We examined the value and utility of small data in Part 1 of this blog series. In that article, we defined “small” in terms of a localized subset of the whole big data collection, and we discussed several approaches and benefits of doing this.

In an earlier article The Importance of Location in Real Estate, Weather, and Machine Learning,” various meanings and applications of location-based knowledge discovery were described within the context of a powerful but strangely named machine learning algorithm: the Support Vector Machine (SVM).

In the remarks below, we summarize the significance and utility of another powerful but strangely named machine learning algorithm that focuses on “location”: Local Linear Embedding (LLE). LLE is a specific example of the general category of Manifold Learning algorithms. (You might be wondering what percentage of machine learning algorithms have strange names like this, and you would be surprised and/or amused to discover that most of them do, as a quick perusal of the Journal of Machine Learning Research article titles reveals. In fact, we are not innocent in this regardour own work on Novelty / Outlier / Anomaly Detection yielded our own contribution to the eclectic algorithm universe: KNN-DD = K-Nearest Neighbors Data Distributions.)

LLE localization infers the true global structure of the data by analyzing local segments of the data mountain. In some cases, LLE may be the only way to uncover truly complex interdependencies and interrelationships within high-dimensional data (for example, as shown in LLE diagram here).

LLE helps us to solve a particular type of problem that occurs when we attempt to build predictive modelsspecifically, the awkward situation in which we discover that apparently the same set of inputs (independent variables) lead to completely different predicted output values for the dependent variable. In order to grasp how this could happen, take some time to examine the diagram that we can find on the LLE page here, which visualizes and resolves the apparent contradiction implied by the statement in the preceding sentence.

When we learn a predictive model f(x,y) from our data (for example, from data values {x,y}) such that the model predicts z=f(x,y), then that model function should output just one value of z for one set of variables (x,y). That is what we call a single-valued function. However, that is not true in the LLE diagram that we examined earlier. Why? Because that data distribution represents a multi-valued function: several different values of z correspond to the same pair of values {x,y}. This occurs simply because there is actually another independent variable (another feature, which may not be known yet, which is called a “hidden variable”) that corresponds with the location along the natural “hyperplane” (the curved surface, or manifold) that holds the data points.

LLE is an example of a “topological” approach to data analytics, which is also used by the company Ayasdi. TDA (Topological Data Analysis) is a powerful discovery method for complex data. Discovering and making use of the natural “shape” of the data distribution is essential for effective analytics and data-driven decision-making.

So, in a nutshell, how does LLE work? Basically, it examines the structural distribution of data points in very localized regions in order to find the natural directions in which the data percolates away from that region. The percolation path will follow the natural surface of the data distribution, and will not “jump the gaps” (e.g., in the vertical direction in the LLE diagram previously discussed above).

An interesting aspect of the manifold (surface) learning process (in either LLE or TDA) is the fact that the semantically correct “distance” metric between two data points is the distance along the manifold (i.e., along the data surface) and is not the apparent distance in the (x,y,z) coordinate space of measured features.  

The true interdependencies, associations, and correlations within our data collection are traced out by the manifold (data surface) that LLE learns. Consequently, it is possible that two points A and B that are right on top of each other in (x,y,z) coordinates may in fact be very far apart on the natural surface of the data space. That means that any similarity metric that calculates the similarity between those two points A and B will need to give a very low value for the similarity. Likewise, any distance metric between A and B must reveal that there is a large distance between A and B (in the “hidden” natural coordinate space of the data).

Since distance and/or similarity metrics are required in essentially all data mining clustering algorithms as well as in some classification algorithms (e.g., K nearest neighbors), it is imperative to discover the natural shape and manifold of the data in order to develop and apply correct and meaningful distance and similarity metrics.

In the end, the focus on very small local regions of the big data collection ultimately enables the correct clusters, segments, categorizations, and classifications of the data points to be assigned. That makes “small data” a very big deal in such complex big data distributions.

So, when we get local with our big data by concentrating on the behavior of objects in smaller localized units, we have the potential for significant discoveries from those small data subsets. If our application allows us to analyze those byte-sized “small data” chunks in parallel, then parallel computing environments like the quick-start Hadoop clusters from MapR would be a perfect match to such a problem. Therefore, don’t get distracted by all of the talk about big data’s big volume. You can go local with big data, and get big results from small data.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free