An Inside Look at the Components of a Recommendation Engine

Recommendation engines help narrow your choices to those that best meet your particular needs.  In this post, we’re going to take a closer look at how all the different components of a recommendation engine work together. We’re going to use collaborative filtering on movie ratings data to recommend movies. The key components are a collaborative filtering algorithm in Apache Mahout to build and train a machine learning model, and search technology from Elasticsearch to simplify deployment of the recommender.

recommendation engine media

What is Recommendation?

Recommendation is a class of Machine Learning that uses data to predict a user's preference for or rating of an item.  Recommender systems are used in industry to recommend:

  • Books and other products (e.g. Amazon)
  • Music (e.g. Pandora)
  • Movies (e.g. Netflix)
  • Restaurants (e.g. Yelp)
  • Jobs (e.g. LinkedIn)

Netflix recommendation engine

The recommender relies on the following observations:

  1. Behavior of users is the best clue to what they want.
  2. Co-occurrence is a simple basis that allows Apache Mahout to compute significant indicators of what should be recommended.
  3. There are similarities between the weighting of indicator scores in output of such a model and the mathematics that underlie text retrieval engines.
  4. This mathematical similarity makes it possible to exploit text-based search to deploy a Mahout recommender using a search engine like Elasticsearch.

recommendation engine architecture

Architecture of the Recommendation Engine

The architecture of the recommendation engine is shown below:

architecture of a recommendation engine

  1. Movie information data  is reformatted, and then stored in Elasticsearch  for searching
  2. An item-similarity algorithm from Apache Mahout is run with user movie ratings data to create recommendation indicators for movies. These indicators are added to the movie documents in Elasticsearch.  
  3. Searches of a user's preferred movies among the indicators of other movies will return a list of new films sorted by relevance to the user's taste.

Collaborative Filtering with Mahout

A Mahout-based collaborative filtering engine looks at what users have historically done and tries to estimate what they might likely do in the future, if given a chance. This is accomplished by looking at a history of which items users have interacted with. In particular, Mahout looks at how items co-occur in user histories.  Co-occurrence is a simple basis that allows Apache Mahout to compute significant indicators of what should be recommended. Suppose that Ted likes movie A, B, and C. Carol likes movie A and B. To recommend a movie to Bob, we can note that since he likes movie B and since Ted and Carol also liked movie B, movie A is a possible recommendation. Of course, this is a tiny example. In real situations, we would have vastly more data to work with.

recommendation grid

In order to get useful indicators for recommendation, Mahout’s ItemSimilarity program builds three matrices from the user history:

1. History matrix:  contains the interactions between users and items as a user-by-item binary matrix.

history matrix

2. Co-occurrence matrix:  transforms the history matrix into an item-by-item matrix, recording which items co-occur or appear together in user histories.

co-occurrence matrix

In this example movie A and movie B co-occur once, while movie A and movie C co-occur twice.  The co-occurrence matrix cannot be used directly as recommendation indicators because very common items will tend to occur with lots of other items simply because they are common.  

3. Indicator matrix: The indicator matrix retains only the anomalous (interesting) co-occurrences that will serve as clues for recommendation. Some items (in this case, movies) are so popular that almost everyone likes them, meaning they will co-occur with almost every item, which makes them less interesting (anomalous) for recommendations.  Co-occurrences that are too sparse to understand are also not anomalous and thus are not retained.  In this example, movie A is an indicator for movie B.    

indicator matrix

Mahout runs multiple MapReduce jobs to calculate the co-occurrences of items in parallel. (Mahout 1.0 runs on Apache Spark).  Mahout’s ItemSimilarityJob uses the log likelihood ratio test (LLR) to determine which co-occurrences are sufficiently anomalous to be of interest as indicators. The output gives pairs of items with a similarity greater than the threshold you provide.

The output of the Mahout ItemSimilarity job gives items which identify interesting co-occurrences, or which indicate recommendation, for each item. For example, the Movie B row shows Movie A is indicated, and this means that liking Movie A is an indicator that you will like Movie B.  

indicator matrix

Elasticsearch Search Engine

elasticsearch search engine

Elasticsearch is an open-source search engine built on top of Apache Lucene™, a full-text search engine library. Full-text search uses precision and recall to evaluate search results:

  • Precision = proportion of top-scoring results that are relevant
  • Recall = proportion of relevant results that are top-scoring

Elasticsearch stores documents, which are made up of different fields. Each field has a name and content. Fields can be indexed and stored to allow documents to be found by searching for content found in fields.

For our recommendation engine, we store movie meta data such as id, title, genre, and also movie recommendation indicators, in a JSON document:


 "id": "65006",

 "title": "Electric Horseman",

 "year": "2008",

 "genre": ["Mystery","Thriller"]


The output row from the indicator matrix that identified significant or interesting co-occurrence is stored in the Elasticsearch movie document indicator field. For example, since Movie A is an indicator for Movie B, we will store Movie A in the indicator field in the document for Movie B. That means that when we search for movies with Movie A as an indicator, we will find Movie B and present it as a recommendation.

recommendation matrix

Search engines are optimized to find a collection of fields by similarity to a query. We will use the search engine to find movies with the most similar indicator fields to a query.

For more resources on building a recommendation engine, we recommend checking out the resources below:


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free