Making Mahout Fast and Easy

Big changes are underway for the open source machine learning project Apache Mahout, and there’s a lot of excitement over this new work. Mahout is a library of very scalable machine learning algorithms that is part of the MapR Distribution for Hadoop. Now these new sweeping changes will make Mahout run enormously faster and much easier to use. The new development will provide Mahout support for Scala, Apache Spark and h2o, making the next release, Mahout 1.0, an awesome solution for machine learning at scale. Users will have some great options for choosing how to build and run their machine learning projects.

Dmitriy Lyubimov, Apache Mahout committer and AgilOne Data Scientist, is taking the initiative with the development of Scala and Apache Spark support for Mahout. MapR Chief Application Architect Ted Dunning, also a Mahout committer, is working with developers from 0xdata to connect Mahout support with h2o. Here’s what both of these efforts will mean for Apache Mahout users:

With the new development to support Scala, Spark and h2o, Mahout will become easy to use and fast.  The impact will be felt across all the capabilities of recommendation, classification and clustering, and big differences for the Mahout Math Library, too. No wonder there’s a lot of buzz over it.

Mahout has been from the start a leader in scalable machine learning. With releases 0.8 and the current release 0.9, the project jettisoned some less-used portions and built up its strengths in recommendation and in clustering. One of the highlights has been the addition of screamingly fast k-means clustering algorithms, and a substantially expanded Mahout Math library that is a useful tool in its own right. With innovations such as using search technology to deploy a recommendation model, Mahout recommendation is already strong.  But now, with the bindings for Apache Spark and h2o, recommendation will be easier and faster, coding math will be much easier thanks to Scala, clustering algorithms also will run much faster and be easier to approach, and Mahout classification is at last really going to shine.

Apache Mahout Diagram

Part of the reason for the overall improved performance that these changes will bring is the change in computational framework. Mahout capabilities each had a different interface with MapReduce and many differences in the API. Now the re-coding of algorithms will provide a clean overall design using the newly supported computational frameworks. 

We are receiving enthusiastic feedback via the Mahout user mailing list and via @ApacheMahout on Twitter from Mahout users and those newly interested in machine learning about the decision to make these changes in Mahout. The comments swirling around on Twitter reflect excitement and are mostly accurate, but there is some confusion on an important point.  To clarify, here’s a summary of some key questions:

Will Mahout ML be easier?  Yes, it will be cleaner and easier to write math, thanks in part to Scala.

Will it be faster? Yes, very fast.

Will users have new choices of computational frameworks? Yes, Apache Spark and h2o will both be supported.

Is the project moving away from MapReduce? Yes, as part of the re-coding to make Mahout faster.

Is the project moving away from Hadoop? No, and that’s important.

Remember that Hadoop distributions serve several purposes, including computation using MapReduce and distributed storage using HDFS in some distributions, and MapR FS in the MapR distribution.  Large scale machine learning built using Apache Mahout needs access to a reliable and efficient way to store the raw data, a data platform on which to do pre-processing and feature extraction of data to make it ready for the machine learning algorithms, and a good data platform to store the output of the model. For all of these, a good Hadoop system is an excellent solution. It is only when it comes to iterative mathematical computations that MapReduce as implemented in Hadoop is unsuitable, such as in Mahout.

You can also join a discussion on the new changes in Mahout via the Mahout mailing lists or in person at the Apache Mahout meetup in the Bay Area, hosted at Intuit in Mountain View on April 17, 2014.

Apache Mahout meetup April 17 RSVP here

Follow on Twitter @ApacheMahout 

Apache Mahout project website

Download a free copy of Practical Machine Learning: Innovations in Recommendation by Friedman and Dunning (pub. by O’Reilly Feb 2014)

Details of the 0.9 release were reported in MapR blog here


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free