Ted Dunning is Chief Application Architect at MapR Technologies and committer and PMC member of the Apache Mahout, Apache ZooKeeper, and Apache Drill projects. Ted has been very active in mentoring new Apache projects and is currently serving as vice president of incubation for the Apache Software Foundation. Ted was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems. He built fraud detection systems for ID Analytics (later purchased by LifeLock) and he has 24 patents issued to date and a dozen pending. Ted has a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.
Improvements in Machine Learning: Apache Mahout 0.8 Release
August 29, 2013
Machine learning with the open source project Apache Mahout just got better with the much anticipated new Mahout version 0.8, released on July 25, 2013. It’s leaner, with less-used features removed and some powerful new ones added, including improved recommendation and a super-fast new clustering algorithm.
Apache Mahout is a scalable library of learning algorithms useful with Hadoop and non-Hadoop big data systems. Mahout algorithms are useful especially in three large areas of machine learning: recommendation, clustering and classification. It’s a well-developed project, in use for several years and with a growing community of users as well as core developers. And now, after a lot of focused, hard work on the part of the Mahout community, Mahout 0.8 is ready for the public.
Apache Mahout version 0.8 is the first new release in a year, and an important step on the way to the fully mature 1.0 version. As an open source project, work goes at an uneven rate, with contributions being made as time permits by people world wide as part of the vibrant Mahout community. But the effort really heated up this summer, with the final run-up to the July release.
One of the final steps before release of version 0.8 was an absolute flood of bug-fixes. In early June this big surge of bug fixes was given a boost by the Berlin Buzzwords conference, which brought together a large group of core Mahout committers a few days before the conference. It’s impressive what can be accomplished through a combination of determination, focus, beer and having people in the same time zone – indeed the same room—as happened in Berlin on the eve of Buzzwords.
I was pleased to be among a group of core committers working together in Berlin. These included Robin Anil – Google (Chicago), Grant Ingersoll – LucidWorks (Raleigh N.C.), Isabel Drost-Fromm – Nokia (Berlin), Dan Filimon- Google (Romania) and Sebastian Schelter – Technische Universität (Berlin) but temporarily at IBM (San Jose).
In addition to extensive updates throughout Mahout code, one of the big changes to Mahout has to do with clustering. Clustering is a form of machine learning that helps find patterns in big data. With the release of 0.8, Mahout has a new super fast k-means clustering algorithm that I find is attracting a lot of attention.
Figure: Bug fixes in the final push for release of Apache Mahout 0.8
Other updates in the 0.8 release include better performance through extensive improvements to vectors, matrices in the Math Library and recommender implementation. Version 0.8 Mahout has better support for algorithms SVD++ and SGD matrix factorization for rating predictions with user and item biases. And by making tests run in parallel, there is a speed-up in the Mahout builds.
To find out more on this new version of Apache Mahout, view the slides and video from a talk by MapR’s Ted Dunning on the day of the release, presented at the Twin Cities HUG in Saint Paul, Minn.
Visit the Apache Mahout website: http://bit.ly/15lvl2x
Follow the Mahout community on Twitter @ApacheMahout: Follow @ApacheMahout