Apache Mahout is a powerful, scalable machine-learning library that runs on top of Hadoop MapReduce.
Machine learning is a discipline of artificial intelligence that enables systems to learn based on data alone, continuously improving performance as more data is processed. Machine learning is the basis for many technologies that are part of our everyday lives. Some examples of applied machine learning algorithms include:
- Recommendation engines: Numerous web sites today are able to make recommendations to users based on past behavior, and the behavior of others. Netflix for example is able to recommend movie to a user based on its similarity to other movies that user has enjoyed.
- Spam filtering: Nearly every modern email provider is able to automatically detect the difference between a spam message and a legitimate one, only presenting the latter ones to the user. These filtering engines use machine-learning algorithms such as clustering and classification.
- Natural Language Processing: Many of us have smartphones that understand what we mean when we ask "When are the niners playing next?". Making a computer understand this phrase is no simple task - it has to know that "niners" is slang for the San Francisco 49ers, which is an American football team, so it needs to consult with the National Football League's schedule to provide the answer. All of this was made possible by applying machine-learning algorithms to vast sets of language data to make these connections.
Until recently, data scientists had to implement and customize machine-learning algorithms manually to the computing framework that they were using, resulting in a significant amount of work. Now, with Hadoop and Mahout, data scientists can write MapReduce jobs that reference a number of predefined algorithms to build these kinds of applications easily.
Below is a current list of machine learning algorithms exposed by Mahout.
- Collaborative Filtering
- Item-based Collaborative Filtering
- Matrix Factorization with Alternating Least Squares
- Matrix Factorization with Alternating Least Squares on Implicit Feedback
- Naive Bayes
- Complementary Naive Bayes
- Random Forest
- Canopy Clustering
- k-Means Clustering
- Fuzzy k-Means
- Streaming k-Means
- Spectral Clustering
- Dimensionality Reduction
- Lanczos Algorithm
- Stochastic SVD
- Principal Component Analysis
- Topic Models
- Latent Dirichlet Allocation
- Frequent Pattern Matching
Practical Machine Learning: Innovations in Recommendation
by Ted Dunning & Ellen Friedman
The next major version, Mahout 1.0, will contain major changes to the underlying architecture of Mahout, including:
- Scala: In addition to Java, Mahout users will be able to write jobs using the Scala programming language. Scala makes programming math-intensive applications much easier as compared to Java, so developers will be much more effective.
- Spark & h2o: Mahout 0.9 and below relied on MapReduce as an execution engine. With Mahout 1.0, users can choose to run jobs either on Spark or h2o, resulting in a significant performance increase.
More details about the next release of Mahout can be found in this blog post.