The K’s of Data Mining – Great Things Come in Pairs

In a previous post, I presented a “Big Data A to Z Glossary of my Favorite Data Science Things”.  I would like to focus here on one of those favorite things, which also happens to be the entry associated with my favorite letter.  

When I learned to read and write (in first grade), my favorite letter of the alphabet soon became the letter K.  Why?  Well, it was primarily because my first name has a pair of them. In fact, 50% of my first name consists of K’s. Furthermore, very few of my classmates could boast even one K in their name! K was special to me.

Years later, when I was a PhD student in Astronomy at Caltech, I began to work in an exciting new field of astrophysics: studying the dynamical evolution of galaxies through their collisions and mergers with each other! In order to understand what happens when two galaxies crash into each other, I needed to investigate a sample of galaxy pairs (i.e., two galaxies in close proximity to one another). As it turns out, one of the early pioneers in the field had just compiled a catalogue of double galaxies – his catalogue provided detailed data for several hundred examples of galaxies in pairs.  That researcher’s name was Igor Karachentsev.  So, when I began publishing research papers on my computational simulations and new observations of galaxy pairs, I used Karachentsev’s catalogue identifiers to label the specific pairs that I was investigating, including K99, K175, K564, etc. It was not long after this that I was confronted by another astronomer who accused me of being very self-centered to name these galaxies after myself: K-this and K-that!  Of course, he misunderstood the origin of the K in the galaxy identifiers.  I was amused by that coincidence – and so continued my interest in the letter K.

As I migrated from astrophysics research to data mining research, and ultimately data science, starting about 15 years ago, I began to see an interesting pattern – the occurrence of algorithms and concepts that began with the letter K.  How kould this be happening? Kould it be just a koincidence? (Note to editor: my misspellings of these K words are intentional.)

Here are some examples of what I am talking about – notice the interesting pairing of a K-word with its corresponding data mining concept or technique:

  • KDD (Knowledge Discovery in Databases) = data mining
  • KNN (K-Nearest Neighbors) = classification algorithm
  • K-means = clustering algorithm
  • K-itemset = the unit of data in association rule mining
  • KNN-DD (K-Nearest Neighbors Data Distributions) = outlier detection algorithm
  • KD-trees = hierarchical (divide and conquer) indexing scheme for rapid search of high-dimensional data

In all of these cases except the first one (KDD), the K refers either to a group of K data entries (KNN, K-itemset, KNN-DD) or to K groupings in the data space (K-means, KD-trees). Most of the K-algorithms listed above are among the most representative, popular, and straightforward methods in data mining corresponding to their respective categories of machine learning (classification, clustering, and outlier detection = supervised, unsupervised, and semi-supervised learning, respectively).

Each of the above algorithms has at least one parallel implementation, which can be run on Hadoop clusters, in the cloud, and/or using MapReduce techniques. For example, let's consider the ubiquitous (perhaps over used) K-means clustering algorithm. K-means begins with the arbitrary specification of the number of clusters (number=K) and an assigned value for each cluster’s mean = the data attribute values corresponding to the centroid (i.e., the mean of the attribute values) for that cluster’s members. These initial values are just guesses, though hopefully they are intelligent and informed guesses.  After this step, the algorithm is run – scan the database, one item at a time, assigning each item to the cluster whose centroid is closest to it (i.e., most similar, using some distance or similarity metric).  After the first full scan of the database, all of the data items have been assigned to one of the K clusters. At that point, the real centroids of the clusters are calculated (by calculating the mean values of the cluster attributes for all of the cluster members).  Then, the database is scanned again, one item at a time, assigning each item again to the cluster whose centroid is closest to that item.  Centroids are re-calculated and the database scan and item-cluster assignments are repeated, until the cluster centroids converge (either because the centroids do not change from one iteration to the next, or because the changes are smaller than some pre-specified threshold).  The final result of K-means clustering is a set of K clusters with fairly homogeneous memberships – the items in one cluster are more similar to themselves than they are to the members of other clusters. How do we know if our initial choice of K was appropriate? And how can we determine how “good” the clustering is (i.e., how compact are the clusters relative to their separations?)  Fortunately, there are some simple but useful cluster evaluation metrics that can be used to validate the strength (purity and isolation) of the clusters, including this pair of metrics: the Dunn Index and the Davies-Bouldin Index.

So, where is the parallelization in the above sequence of steps?  It can actually be implemented in a pair of different places.  First, it can be used very effectively in the database scans that take place at each step in the cluster iteration process. The data items in the database can be partitioned to different nodes in the parallel computing environment, where the assignment of each data item to one of the K clusters is determined.  That can be done in an embarrassingly parallel manner, with no reference to other data items.  Similarly, parallelization can be applied in another stage of the K-means clustering algorithm. The calculation of the mean (centroid) of a massive set of data items (e.g., within individual clusters) can also be computed quickly using a MapReduce approach, which usually carries out its task by invoking the “Map” function on pairs of data entries.

So, whether you are Hadoop’ing and data mining some big data with the MapR M7 Enterprise Database Edition for Hadoop or just exploring the rich kollection of data science koncepts, spending some quality time with the letter K is a good place to start.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free