Mahout on Spark: What’s New in Recommenders—Part 2

In the last post we talked about creating a co-occurrence indicator matrix for a recommender. The goals for a recommender are many but first they must be fast and must make "good" personalized recommendations. We'll start with the basics and improve on the "good" part as we go. As we saw last time Cooccurrence or Item-Based recommender are described by:
rp = hp[AtA]

Calculating [AtA]

We created a demo web app using the technique described here. It's a discovery app for online videos called Input was collected for many users in the form


For now we will ignore the "dislikes" for training purposes, we'll filter them out (but we'll revisit this later.) The Mahout 1.0 version of spark-itemsimilarity can read these files directly with the following command line:

mahout spark-itemsimilarity -i root-input-dir \
-o output-dir \
--filter1 like -fc 1 -ic 2 \ 
#filter out all but likes \
#indicate columns for filter and items \
This will give us output like this:
the_hunger_games_catching_fire<tab>holiday_inn 2_guns superman_man_of_steel five_card_stud district_13 blue_jasmine this_is_the_end riddick ...
law_abiding_citizen<tab>stone ong_bak_2 the_grey american centurion edge_of_darkness orphan hausu buried ...
munich<tab>the_devil_wears_prada brothers marie_antoinette singer brothers_grimm apocalypto ghost_dog_the_way_of_the_samurai ...
private_parts<tab>puccini_for_beginners finding_forrester anger_management small_soldiers ice_age_2 karate_kid magicians ...
world-war-z<tab>the_wolverine the_hunger_games_catching_fire ghost_rider_spirit_of_vengeance holiday_inn the_hangover_part_iii ...

This is a tab-delimited file with a video itemID followed by a string containing a list of similar videos. Don't expect to see items that seem to be on similar subjects or are even from the same genre. Here, similar means that they were liked by the same people. We'll use another technique to narrow the items down to ones of the same genre later.

Notice that both the input and the output use the unique IDs defined by the application.

Search Engine as Similarity Engine

In the Guide demo app we want to make good recommendations even to new users. In order to do this we need to do the background training phase to calculate [AtA] using whatever data we have previously collected as shown above. We need to capture hp by recording new user preferences and then somehow performs the rp = hp[AtA].

Remember in the first post we described using similarity in place of a matrix (or vector) multiply. We can use the same trick with one of the highly optimized and scalable similarity engines available in open source. At the core search engines are really NoSQL similarity engines. They index data in the form of documents of text (matrix of token vectors) and take text as the query (a token vector). Another way to look at this is that search engines find by example-they are optimized to find a collection of items by similarity to the query. We will use the search engine to find the most similar indicator vectors in [AtA] to our query hp thereby producing rp.

To index the indicator matrix from the output of spark-itemsimilarity we need to decide how we want to integrate with the search engine. In the Guide demo we chose to create a catalog of items in a database and to use Solr to index columns in the database. Both Solr and Elasticsearch have highly scalable fast engines.

We have a schema like this

itemID foreign-key      genres        indicators
123    world-war-z      sci-fi action the_wolverine …
456    captain_phillips action drama  pi when_com…

We must now set the search engine to index the indicators. This integration is usually pretty simple and depends on what database you use. As an alternative you can store the entire catalog in files that Solr can index, this will leave the database out of the loop. Once you've triggered the indexing of your indicators we are ready for the query. The query will be a preference history vector consisting of the same tokens you see in the indicators-application specific video IDs. For a known user these should be logged and available, perhaps in the database, but for a new user we'll have to find a way to encourage preference feedback.

New Users

The demo site asks a new user to create an account and run through a trainer that collects important preferences from the user. How we determine "important preferences" is another blog post but it involves clustering items and taking popular ones from each cluster so that the users are more likely to have seen them.

From this we see that the user likes:
argo django iron_man_3 pi looper …

Using that as a query on the indicator field of the database returns recommendations even though the new user's data was not used to train the recommender. Here's what we get:

The first line shows the result of the search engine query for the new user. The trainer on the demo site has several pages of examples to rate and the more you rate the better the recs become, as one would expect but these are pretty good (I used my own taste to rate so my opinion counts even if the "eyeball test" is typically not very reliable).

The second line of recommendations and several more below it are calculated using a genre and this begins to show the power of the search engine method. In the trainer I picked movies where the #1 genre was "drama". If you have the search engine index both indicators as well as genres you can combine indicator and genre preferences in the query.

To produce line 1 the query was:

indicator field: "argo django iron_man_3 pi looper …"
To produce line 2 the query was:
indicator field: "argo django iron_man_3 pi looper …"
genre field: "drama"; boost: 5

The boost is used to skew results towards a field. In practice this will give you mostly matching genres but is not the same as a filter, which can also be used if you want a guarantee that the results will be from "drama".


Combining a search engine with Mahout created a recommender that is extremely fast and scalable but also seamlessly blends results using collaborative filtering data and metadata. Using metadata boosting in the query allows us to skew result in some direction that makes sense.

Using multiple fields in a single query gives us even more power than this example shows. It allows us to mix in different actions. Remember the "dislike" action that we discarded? One simple and reasonable way to use that is to filter results by things the user disliked and the demo site does that. But we can go even further; we can use dislikes in a cross-action recommender. Certain of the user's dislikes might even predict what they will like but that requires us to go back to the original equation so we'll leave it for another post.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free