Advances in Apache Mahout: Highlights for the 0.9 Release

Scalable machine learning for Apache Hadoop-based systems got a boost recently when the Apache Mahout PMC approved release of the 0.9 version of Mahout. This release is the second in less than a year, and it’s another step toward a stable, mature scalable machine learning library. The open source Apache Mahout community has been very active in the last year, with new releases, active discussions on the user and developer mailing lists, new publications and engagement via Twitter. This activity reflects the growing interest in machine learning on large data sets among a wider audience including those in real-world business settings.

What’s new with Mahout 0.9?  This release was coordinated by release manager and committer Suneel Marthi. It has three main overarching goals:

  • Bug clean-up
  • Features new for 0.9
  • Groundwork for the fully mature 1.0 version

In keeping with the large changes introduced in version 0.8 in July 2013, the current release has taken steps to streamline Mahout in order to focus on those features and algorithms that are most effective and most widely used. In addition, new capabilities are introduced with 0.9 that strengthen the project in preparation for full maturity when stability and well rounded function will be particularly important. Some highlights of the new features and changes for the 0.9 version of Mahout are described here.

New Mahout Feature: Scala Support Mahout - Scala Support      

The 0.9 release of Mahout has added support for Scala, or “scalable language". Scala is interoperable with Java and runs on the JVM. The scalability results from Scala’s use of object-oriented and functional language concepts.

Why include Scala with the Mahout project? Scala provides a more concise way to write mathematical programs. Inclusion of Scala DSL bindings for the Mahout Math Linear Algebra is another step in strengthening and expanding the Mahout Math Library. The Math Library is not only valuable as a tool chest for Mahout projects including anomaly detection, recommendation, clustering and more, but it can be a useful tool for non-Mahout projects. The Scala support is one more way in which Mahout is looking forward to its goals for maturity. The Scala support contribution was made by Apache Mahout committer Dmitry Lyubimov. See the release notes and JIRA for details [1].

New Mahout Feature: Recommenders Using Search Technology

An innovative approach to recommendation is to exploit search technology to deploy a recommendation engine. Output of the recommender learning models, such as those built at scale using Mahout algorithms, can be converted to a form that can be indexed and found using text-based search technologies. This use of search  to deploy a recommendation system greatly simplifies implementation and has considerable benefits for production level recommendation in real-world business settings.  Pat Ferrel worked up an example of this approach using search with recommendation and contributed a description of it to Mahout [1].

A short publication that explains recommenders and search was recently published by O’Reilly:  Practical Machine Learning: Innovations in Recommendation by Ted Dunning and Ellen Friedman (Feb 2014) It is presently available for download courtesy of MapR: http://bit.ly/1owrtpe.

New Mahout Feature: Groundwork for Neural Networks

The classification options in Mahout are good but have been somewhat less broadly strong than the offerings for recommendation. Now that is changing, with improvements in the form of early of neural network support.

What are neural networks?

Originally inspired by biological systems like the human visual cortex, neural networks  are non-linear learning systems arranged in layers that are ultimately capable of generalizations. Each layer transforms the output of the lower layer, with the final output being a high level abstraction if the learning is successful.

Neutral networks have developed from simple systems in the 1950’s, through multiple phases of euphoria and disillusionment to the current state where they have recently been used in the most advanced voice recognition and image analysis software available.  This neural network approach is one way to do a powerful form of multi-layer machine learning known as deep learning.

Mahout neural networks

Why include neural networks in Apache Mahout?

The 0.9 release for Mahout has added the Multi Layer Perceptron (MLP) classifier as a first step toward providing neural networks in order to improve the offerings for classification.  Neural networks are a way of building a powerful classifier that allows very complex patterns to be represented, a much needed form of classification in many circumstances. What’s important about neural nets for this purpose is that despite their complexity, the models are trainable from example data, unlike many other complex approaches. The 0.9 release MLP is an early version provided to get feedback before fully integrating into Mahout to work with the Mahout vectors. This contribution was made by Yexi Jiang.

New Mahout Feature: Online Algorithm to Compute Accurate Quantiles

A new Mahout feature with widespread applicability is an online algorithm that uses one-dimensional clustering to compute very accurate extreme quantiles. Other approaches would require storing huge amounts of data in order to get a very accurate estimate of extreme quantiles in the range 99 – 99.9999%. The new approach provided in Mahout, called the t-digest, can accurately estimate quantiles as extreme as 99.995% or more using streaming data, so sorting of the data is not required.

There are widespread uses for accurate estimates of extreme quantiles, such as setting a threshold for alarms in automated anomaly detections systems. The t-digest algorithm to accurately estimate extreme quantiles was developed by Mahout committer and MapR Technologies Chief Application Architect Ted Dunning.

Streamlining Mahout

As with the previous release, one “feature” of the 0.9 version is removing less useful or less popular features. The 0.9 version also streamlined Mahout by deprecating a selection of algorithms that were either under performing or not widely used. A list of the algorithms that were removed is included in the release notes provided on the Apache Mahout website by Suneel [1].

For more information or to get involved with the open source Apache Mahout project, consider these resources:

[1] Apache Mahout project website that includes 0.9 release notes https://mahout.apache.org/

[2] Practical Machine Learning: Innovations in Recommendation by Ted Dunning and Ellen Friedman (Feb 2014) Download courtesy of MapR http://bit.ly/1owrtpe

Follow the project and community on Twitter: @ApacheMahout  https://twitter.com/ApacheMahout

no

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free