Building a Classification Model Using Spark

In this tutorial, we are going to build a real-time classification model using Spark on the MapR Converged Data Platform. The details of this example are described in previous blog posts. The first blog, "Real-Time User Profiles with Spark, Drill and MapR-DB" describes the scenario and dataset, with instructions on how to compute some basic statistics and build the training model. The second blog, "Classifying Customers with MLLib and Spark" describes the classification model used to classify customers.

We’ll be using a hypothetical music streaming site in this tutorial, where customers log in to this service and listen to music tracks. We will use Python, PySpark and MLLib to compute some basic statistics and a simple training model that is used as input to the classifier.


  • 8GB RAM, multi-core CPU
  • 20GB minimum HDD space
  • Internet access



  • Files with the data and Python scripts to use for this tutorial can be found here:
    • The following data files are provided:

      • tracks.csv - a collection of events, one per line, where each event is a client listening to a track. This data is approximately 1M lines and contains simulated listener events over several months.
      • cust.csv – information about individual customers
      • music.csv – information about music tracks
      • clicks.csv – history about clicking on advertisements

      Once you have all the necessary software installed, log in to the MapR Sandbox as the user mapr. Password is mapr. Copy the folder spark_music_demo to the /user/mapr folder (cp -R spark_music_demo /user/mapr) .


      1. Create MapR-DB tables
      2. Load data into MapR-DB tables
      3. Compute summary information
      4. Classify customers

      Step 1. Create MapR-DB tables
      Create three tables – cust_table, agg_table and live_table. Each one of the tables has one column family – cdata, adata and ldata, respectively. You can create the tables with the column families using the HBase shell, maprcli or the MapR Co ntrol System (MCS). The instructions here use the HBase shell.

      1. $ hbase shell – to launch the HBase shell
      2. Each one of the following commands will create the table specified in /user/mapr with the corresponding column family.
        hbase> create '/user/mapr/cust_table', {NAME=>'cdata'}
        hbase> create '/user/mapr/agg_table', {NAME=>'adata'}
        hbase> create '/user/mapr/live_table', {NAME=>'ldata'}

      Step 2. Load data into MapR-DB tables
      The cust_table is populated with the data in the cust.csv file using the script

      1. Change directories to /user/mapr/spark_music_demo.
      2. Edit the file if necessary to adjust any parameters such as path.
      3. Run the script $ python

      You should see output indicating the records were loaded.

      Step 3. Compute summary information for each customer
      We use Spark to compute the summary information for each customer, as well as some basic statistics about the user base. The script computes a summary profile for each user that contains the average number of tracks listened to during each period of the day – morning, afternoon, evening, and night; total unique tracks listened by that user, and total mobile tracks listened by that user. The computed data is written back to the MapR-DB table – live_table and aggregated data to agg_table that were created in Step 1. This script also generates the training file called “features.txt” that contains all of the points for training one or more classifiers.

      1. Edit the script to modify the paths to the tables and .csv files as needed. (Note that this script looks for the .csv files in the /user/mapr folder. Either copy the files into the /user/mapr folder or modify the path in the script).
      2. $ /opt/mapr/spark/spark-1.2.1/bin/spark-submit ./
      3. Once the script runs, view the summary. The information shown here is the average values across all users.
      4. You should also see the training file features.txt created under the directory specified in This file serves as input into the prediction about which customers are likely to click on the ad based on their past behaviors.

        The 0 or 1 at the beginning is our label and indicates if this customer clicked on the particular ad, and the remainder of the line lists the values for each feature.

      Step 4. Classify customers
      We can now run our two different models. To summarize, we have the following features of our data:

      • Listening behaviors: the share of overall tracks listened to during the morning, afternoon, evening and night hours
      • Mobile tracks listened (share of overall)
      • Unique tracks listened (share of overall)

      The script trains the two models on the training set, classifies customers in the test set, and measures how many were correct. You often use the same data to train a few different classifiers, and then compare results before choosing which one (or group) to use in production. We do that here with two: LogisticRegressionWithSGD and LogisticRegressionWithLBFGS, both of which are provided by the pyspark.mllib package.

      1. Edit the script /user/mapr/spark_music_demo/ to modify the paths to the training file – features.txt if needed
      2. Run the script
        $ /opt/mapr/spark/spark-1.2.1/bin/spark-submit ./

      You should see the results as shown below.

      Since we had 5000 total customers in the database, a 70/30 split ended up being 3500 and 1500 for training and testing, respectively. You can see that LBFGS did somewhat better than SGD at a high level with a lower overall error rate.

      In this tutorial, we use Spark with MapR to build a classification model. We compute data (in MapR-DB tables) about customer behaviors with Spark. We then add the use of MLLib to classify our customers to fine-tune the experience on our platform. We enabled the ability to make a decision quickly, and we created an opportunity for a new revenue stream. This simple example shows you what is possible with Spark on MapR, and how you can start to use classifications from Hadoop data sources to make better decisions.

Tutorial Category Reference: