Real estate experts like to say that the three most important features of a property are: location, location, location! Likewise, weather events are highly location-dependent. We will see below how a similar perspective is also applicable to machine learning algorithms.
Location, Location, Location
In real estate, the buyer is first and foremost concerned about location for at least 3 reasons: (a) the desirability of the surrounding neighborhood; (b) the proximity to schools, businesses, services, etc.; and (c) the value of properties in that area. Similarly, meteorologists tell us that all weather is local. Location is significant in weather for at least 3 reasons also: (a) specific weather events are almost impossible to predict due to the massive complexity of micro-scale interactions of atmospheric phenomena that are spread over macro-scales of hundreds of miles; (b) the specific outcome of a weather prediction may occur only in highly localized areas; and (c) the minute details of a location (topography, hydrology, structures) are too specific to be included in regional models and yet they are very significant variables in micro-weather events. Side note: we might have a good start here on generating some predictive models (for real estate sales or for weather), if we could parameterize the above location-based features and score them appropriately.
Another aspect of “location” is the boundary region between different areas. This boundary region can affect real estate sales, especially if a desirable area is adjacent to an undesirable area. While conditions (prices, market factors, resale values) may be well understood deep within each of the two areas, there is more uncertainty in the boundary region. This is similarly true for the weather, as was especially evident in the big snow and ice storms that swept across the United States on March 3, 2014. For those of us in the Baltimore-Washington region, we were expecting significant snow, ice rain, and sleet. What we received was a moderate amount of snow across most of the region and not much else. This “less significant” weather event was partially due to the fact that cold dry air from the north won the battle against warm wet air from the south, pushing dryer air into the region than was expected. Wrong predictions for massive snowfalls are not unusual in this part of the country primarily because this latitude is often within the boundary region between the northern weather circulation patterns and the southern circulation patterns. It is often difficult to predict reliably which weather pattern will win the battle in the boundary region during any particular storm.
Location in Machine Learning
Location is also very important in many machine learning algorithms. The simplest classification (supervised learning) algorithms in machine learning are location-based: classify a data point based on its location on one side or the other of some decision boundary (Decision Tree), or classify a data point based on the classes of its nearest neighbors (K-nearest neighbors = KNN). Furthermore, clustering (unsupervised learning) is intrinsically location-based, using distance metrics to ascertain similarity or dissimilarity among intra-cluster and inter-cluster members. All of this is a natural consequence of the fact that humans place things into different categories (or classes) when we see that different categories of items are clearly separated in some feature space (i.e., occupying different locations in that space). The challenge to data scientists is to find the best feature space for distinguishing, disentangling, and disambiguating different classes of behavior. Sometimes (though not often) those “best” features are the ones that we measured at the beginning, but we can usually discover improved classification features as we explore different combinations (linear and nonlinear) of the initial measured attributes. Those improved features then represent a phase space in which a previously unseen data item will receive an accurate classification simply based on that item’s location within the improved feature space.
Another challenge to data scientists is to explore increasingly higher dimensional parameter spaces in an attempt to discover new subclasses of known classes (unknown knowns) – those subclasses may project (in lower dimensions) on top of one another in some feature space (hence, the initial incorrect assignment all of the data items in those subclasses into one class), but the subclasses may separate from one another when additional dimensions are added. This may lead to discoveries of new properties of a physical system, or new customer behaviors, or new threat vectors in cybersecurity, or improved diagnoses in medical practice, or fewer false positives in testing systems, or improved precision in information retrieval of documents.
Boundary Cases in Machine Learning
In machine learning, the most difficult items to classify correctly (or to place robustly into a specific cluster) are those within (or near) the boundary region between classes (or clusters). These items may not be accurately distinguishable by the set of decision rules inferred in a decision tree model, or they may have roughly equal numbers of nearest neighbors from the different possible classes (leading to poor KNN performance), or they may have equal affinity to two different clusters. As a consequence of this, one could arrive at the conclusion that these particular data items are not at all useful in cluster-building or in constructing an accurate classifier. We may say to ourselves: “how can these items be useful if we cannot even place them within a category with better than 50% accuracy or repeatability?” (This is similar to how I react to weather forecasts for major snow events in my area – cautiously uncertain!) While this attitude is justified, it is actually wrong. In fact, the items in the boundary region are golden!
In the field of supervised machine learning, one of the most powerful and successful classification algorithms is SVM (Support Vector Machines). This is a strange name for an algorithm. It also refers to the boundary region data points in a strange way – as support vectors! So, what are these support vectors? They are the data items in the boundary region! They are precisely the labeled (classified) data items in the training set that provide the most powerful means to distinguish, disentangle, and disambiguate different classes of behavior. These are the data points that carry the most vital information about what distinguishes items on either side of a decision boundary. These are the items at the front lines of the battle for classification accuracy. They are the standard-bearers for their class, the ones whose classification rules are most critical in building the most accurate classifier for your data collection. Their feature vectors are therefore the “support vectors” for their class.
This is all great! But, unfortunately, the boundary region (like any “war front” or snowstorm weather front) is a messy place. There is much confusion. The boundary lines are usually not straight – a simple linear classification boundary is unlikely to be realistic or accurate. Consequently, SVM is invoked to discover the complex nonlinear hyperplane that separates most cleanly (with maximum margin) the support vectors representing the different classes. This is no easy task. SVM is not only nonlinear, but it usually requires a kernel transformation between your measured features (data attributes) to some other more complex feature space. To discover these transformation rules is computationally intensive, scaling as N-squared, where N is the number of data items in the training set, which must be explored in depth, examining all pairwise combinations of data items (hence N-squared complexity) in order to find the support vectors (the data items within and around the boundary region between the classes). If N is large, as in most big data projects these days, then executing the SVM algorithm can become computationally prohibitive. However, large N (i.e., Big Data) is actually a powerful ally in SVM – one cannot be certain in small data collections that you will actually have instances of data items in the boundary region – you cannot be guaranteed to have many (or any) good support vectors with which to train your model. With a massive big data sample, it is much more likely that you will have sufficient examples of support vectors to build an accurate predictive model.
MapReduce and Hadoop to the Rescue
The MapReduce programming model, implemented in Hadoop systems, can help crush the computational N-squared bottleneck in SVM algorithms. Divide and conquer is a good strategy on this battle front. In particular, the large-N training set can be subdivided into many small-N subsets. Support vectors can be searched for and identified, and then SVM can be applied to each of those subsets in parallel, on a Hadoop cluster, at much greater speed than the full dataset (with N-squared performance improvement – from large-N overall to small-N on each cluster node). The results from each of those SVM preliminary models can be combined into a master set of potential support vectors. Another round of SVM can be executed using new different subsets of the training set, again with increased performance compared to the full dataset. Combining and comparing the results of these multiple SVM runs should converge to a final set of optimum support vectors and thus lead efficiently to a solution for the maximum-margin separating hyperplane, as described in greater detail in this article: “Support Vector Machines and Hadoop: Theory vs. Practice”. Finally, the result will be an accurate (location-based) SVM classifier of complex (large-variety) data, even in the uncertain boundary region between different classes of behavior.
MapR has recently been named a leader in Big Data Hadoop Solutions by Forrester Research Inc. Using MapR’s Hadoop solution on big data analytics problems is an excellent way to handle the complex challenges of location, location, location, by enabling computationally intensive algorithms like SVM to make discoveries in big data collections efficiently with high throughput.