Machine Learning Blog Posts

Posted on January 10, 2017 by Mathieu Dumoulin

This series of blog posts details my findings as I bring to production a fully modern take on Complex Event Processing, or CEP for short. In many applications, ranging from financials to retail and IoT applications, there is tremendous value in automating tasks that require to take action in real time. Putting aside the IT system and frameworks that would support this capability, this is clearly a useful capability.

Posted on January 9, 2017 by Mathieu Dumoulin

This post is intended as a detailed account of a project I have made to integrate an OSS business rules engine with a modern stream messaging system in the Kafka style. The goal of the project, better known as Complex Event Processing (CEP), is to enable real-time decisions on streaming data, such as in IoT use cases.

Posted on January 6, 2017 by Ranjit Lingaiah

There has been a lot of research in document image processing over the past 20 years, but not much research has been done in terms of parallel processing. Some of the solutions proposed for parallel processing have been to create threads of execution for each image, or to use GNU Parallel.

Posted on January 5, 2017 by Carol McDonald

The first post discussed creating a machine learning model using Apache Spark’s K-means algorithm to cluster Uber data based on location. This second post will discuss using the saved K-means model with streaming data to do real-time analysis of where and when Uber cars are clustered.

Posted on November 28, 2016 by Carol McDonald

According to Gartner, by 2020, a quarter of a billion connected cars will form a major element of the Internet of Things. Connected vehicles are projected to generate 25GB of data per hour, which can be analyzed to provide real-time monitoring and apps, and will lead to new concepts of mobility and vehicle usage.

Posted on October 17, 2016 by Carol McDonald

In this blog post, I’ll help you get started using Apache Spark’s Logistic Regression for predicting cancer malignancy. Spark’s library goal is to provide a set of APIs on top of DataFrames that help users create and tune machine learning workflows or pipelines.

Posted on August 30, 2016 by Dong Meng

Apache PredicitonIO is an open source machine learning server. In this article, we integrate Apache PredictionIO with the MapR Converged Data Platform 5.1 as a backend. Specifically, we use MapR-DB (1.1.1) for event data storage, ElasticSearch for metadata storage, and MapR-FS for model data storage.

Posted on August 8, 2016 by Carol McDonald

This post is the first in a series where we will review examples of how Joe Blue, a Data Scientist in MapR Professional Services, assisted MapR customers in identifying new data sources and applying machine learning algorithms in order to better understand their customers. The first example in the series is an advertising customer 360°; the next example in the series will be banking and healthcare customer 360° examples.

Posted on July 12, 2016 by Carol McDonald

Random forests are one of the most successful machine learning models for classification. In this blog post, I’ll help you get started using Apache Spark’s Random forests for classification of bank loan credit risk.

Posted on April 20, 2016 by Nick Amato

One of the most useful things to do with machine learning is inform assumptions about customer behaviors. This has a wide variety of applications: everything from helping customers make superior choices (and often, more profitable ones), making them contagiously happy about your business, and building loyalty over time.

Posted on April 7, 2016 by William Cairns

Having participated in a number of fantasy sports leagues and being a Data Scientist at MapR gives me a unique perspective on my approach to choosing who I think will most likely “win” the predictions for the six players, ranked in order, who I predict will most likely to finish in 10th or better place this year (and hopefully 1st) based on my statistical modeling are:

Posted on March 22, 2016 by Ben Sadeghi

Churn prediction is big business. It minimizes customer defection by predicting which customers are likely to cancel a subscription to a service. Though originally used within the telecommunications industry, it has become common practice across banks, ISPs, insurance firms, and other verticals.

Posted on February 22, 2016 by Carol McDonald

Decision trees are widely used for the machine learning tasks of classification and regression. In this blog post, I’ll help you get started using Apache Spark’s MLlib machine learning decision trees for classification.

Posted on February 9, 2016 by Tugdual Grall

Streaming data is of growing interest to many organizations, and most applications need to use a producer-consumer model to ingest and process data in real time. Many messaging solutions exist today on the market, but few of them have been built to handle the challenges of modern deployment related to IoT, large web based applications and related big data projects.

Posted on January 7, 2016 by Dong Meng

XGBoost is a library that is designed for boosted (tree) algorithms. It has become a popular machine learning framework among data science practitioners, especially on Kaggle, which is a platform for data prediction competitions where researchers post their data and statisticians and data miners compete to produce the best models.

Posted on August 17, 2015 by Ted Dunning

Ted Dunning, Chief Applications Architect for MapR, talks about some newer streaming algorithms such as t-digest and streaming k-means.

Posted on August 3, 2015 by Carol McDonald

Recommendation systems help narrow your choices to those that best meet your particular needs, and they are among the most popular applications of big data processing. In this post we are going to discuss building a recommendation model from movie ratings. We’ll be using an iterative algorithm and parallel processing with Apache Spark MLlib.

Posted on May 27, 2015 by Nick Amato

In this demo we are using Spark and PySpark to process and analyze the data set, calculate aggregate statistics about the user base in a PySpark script, persist all of that back into MapR-DB for use in Spark and Tableau, and finally use MLlib to build logistic regression models.

Posted on May 12, 2015 by Nick Amato

In this post, I’ll give an example of how we can make predictions that enable us to maximize revenue and ensure the best customer experience. We'll do this using the output of the Spark code from our last adventure.

Posted on May 6, 2015 by Joseph Blue

Building a good classification model requires leveraging the predictive power from your data and that’s a challenge whether you’re looking at four thousand records or four billion; in machine learning parlance, this step is referred to as “feature extraction”.

Posted on April 9, 2015 by Carol McDonald

Recommendation engines help narrow your choices to those that best meet your particular needs. In this post, we’re going to take a closer look at how all the different components of a recommendation engine work together. We’re going to use collaborative filtering on movie ratings data to recommend movies. The key components are a collaborative filtering algorithm in Apache Mahout to build and train a machine learning model, and search technology from Elasticsearch to simplify deployment of the recommender.

Posted on October 20, 2014 by Kirk Borne

This is the first installment of a two-part series on the value of doing small data analyses on a big data collection. In this first article, we describe the applications and benefits of “small data” in general terms from several different perspectives. In Part 2 of this series, we’ll spend some quality time with one specific algorithm (Local Linear Embedding) from a broader class of algorithms (Manifold Learning) that enable local subsets of data (i.e., small data) to be used in developing a global understanding of the full big data collection.

Posted on September 23, 2014 by Kirk Borne

A project manager once told me that “any job worth doing is worth doing poorly.” I understood exactly what she meant, and she knew that I would understand, especially when she preceded our conversation with these words: “I wouldn’t say this to everyone, but I know you will understand what I mean.”  The message was clear to me because I was a perfectionist (and hopefully I have learned over the years to be less of a perfectionist thanks to my project manager’s wise counsel).

Posted on August 12, 2014 by Pat Farrel
There are big changes happening in Apache Mahout. For years it’s been the go to machine learning library for Hadoop. It contained most of the best-in-class algorithms for scalable machine learning, which means clustering, classification, and recommendation. But it was written for Hadoop and mapreduce. Today a number of new parallel execution engines show great promise in speeding calculations by as much as 10-100x (Spark, H2O, Flink). That means instead of buying 10 computers for a cluster, a single one may do. That should get you manager’s attention.
Posted on July 25, 2014 by Karen Whipple

The recent Skytree and MapR webinar ”Predictive Analytics with Machine Learning and Hadoop” proved to be highly interactive and engaging.  As promised, Nitin and Jin have provided answers to questions that we were not able to get to during the webinar:

Posted on March 31, 2014 by Ellen Friedman

Big changes are underway for the open source machine learning project Apache Mahout, and there’s a lot of excitement over this new work. Mahout is a library of very scalable machine learning algorithms that is part of the MapR Distribution for Hadoop. Now these new sweeping changes will make Mahout run enormously faster and much easier to use.

Posted on March 3, 2014 by Ellen Friedman
Does it make sense for me to have a car? If so, which one is the best choice for my needs: a gasoline, hybrid, or electric? And should I buy or lease? In order to make an effective decision, I need to understand key issues about the design, performance, and cost of cars, regardless of whether or not I actually know how to build one myself.
Posted on February 20, 2014 by Kirk Borne

Ted Dunning (Chief Application Architect at MapR) and Ellen Friedman have written a new O’Reilly Media book on Practical Machine Learning – Innovations in Recommendation(released in January 2014).  This book examines one of the most interesting, fun, and powerful data science applications in the big data universe: recommendation systems.

Posted on February 19, 2014 by Ellen Friedman

Scalable machine learning for Apache Hadoop-based systems got a boost recently when the Apache Mahout PMC approved release of the 0.9 version of Mahout. This release is the second in less than a year, and it’s another step toward a stable, mature scalable machine learning library. The open source Apache Mahout community has been very active in the last year, with new releases, active discussions on the user and developer mailing lists, new publications and engagement via Twitter.

Posted on November 11, 2013 by Ellen Friedman
Ted Dunning, MapR's Chief Applications Architect, recently presented an invited talk titled "Which Algorithms Really Matter?" at the CIKM conference in San Francisco on October 30th, and it's generated a lot of discussion. In less than a week after the talk, over 4500 people viewed the slides posted online. Why is there so much interest?
Posted on October 17, 2013 by Ted Dunning
I was recently asked if I was aware of any papers that talk about systematic exploration of parameters for Matrix Factorization (MF).

Automatic grid search is pretty standard in this sort of area. In general, however, offline evaluation of recommender algorithms is extremely dangerous and potentially quite misleading. Modern practitioners believe that real-time testing is producing better results.

Posted on August 29, 2013 by Ted Dunning

Machine learning with the open source project Apache Mahout just got better with the much anticipated new Mahout version 0.8, released on July 25, 2013. It’s leaner, with less-used features removed and some powerful new ones added, including improved recommendation and a super-fast new clustering algorithm.

Posted on August 21, 2013 by Ted Dunning
We are often asked by potential customers if Apache Mahout ™ integrates well with the MapR M7 Edition. The quick answer is, "Yes!” Mahout itself is extremely portable, and it easily connects with M7 where appropriate. The advantage of running Mahout on MapR has more to do with development simplicity, speed and reproducibility.

Advantage #1: You can easily mix and match modes without having to move data assets back and forth.
Posted on April 1, 2013 by Ellen Friedman
It’s not just how you store big data but what you can do with it – and that was apparent as Java developers took part in Devoxx conferences in London and Paris last week. Participants had a lot to say about the international presenters, and among those was MapR Chief Application Architect and Apache Mahout committer Ted Dunning.

Blog Sign Up

Sign up and get the top posts from each week delivered to your inbox every Friday!

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free