There has been a lot of research in document image processing over the past 20 years, but not much research has been done in terms of parallel processing. Some of the solutions proposed for parallel processing have been to create threads of execution for each image, or to use GNU Parallel.
Machine Learning Blog Posts
The first post discussed creating a machine learning model using Apache Spark’s K-means algorithm to cluster Uber data based on location. This second post will discuss using the saved K-means model with streaming data to do real-time analysis of where and when Uber cars are clustered.
According to Gartner, by 2020, a quarter of a billion connected cars will form a major element of the Internet of Things. Connected vehicles are projected to generate 25GB of data per hour, which can be analyzed to provide real-time monitoring and apps, and will lead to new concepts of mobility and vehicle usage.
In this blog post, I’ll help you get started using Apache Spark’s spark.ml Logistic Regression for predicting cancer malignancy. Spark’s spark.ml library goal is to provide a set of APIs on top of DataFrames that help users create and tune machine learning workflows or pipelines.
Apache PredicitonIO is an open source machine learning server. In this article, we integrate Apache PredictionIO with the MapR Converged Data Platform 5.1 as a backend. Specifically, we use MapR-DB (1.1.1) for event data storage, ElasticSearch for metadata storage, and MapR-FS for model data storage.
This post is the first in a series where we will review examples of how Joe Blue, a Data Scientist in MapR Professional Services, assisted MapR customers in identifying new data sources and applying machine learning algorithms in order to better understand their customers. The first example in the series is an advertising customer 360°; the next example in the series will be banking and healthcare customer 360° examples.
Random forests are one of the most successful machine learning models for classification. In this blog post, I’ll help you get started using Apache Spark’s spark.ml Random forests for classification of bank loan credit risk.
One of the most useful things to do with machine learning is inform assumptions about customer behaviors. This has a wide variety of applications: everything from helping customers make superior choices (and often, more profitable ones), making them contagiously happy about your business, and building loyalty over time.
Having participated in a number of fantasy sports leagues and being a Data Scientist at MapR gives me a unique perspective on my approach to choosing who I think will most likely “win” the tournament...my predictions for the six players, ranked in order, who I predict will most likely to finish in 10th or better place this year (and hopefully 1st) based on my statistical modeling are:
Churn prediction is big business. It minimizes customer defection by predicting which customers are likely to cancel a subscription to a service. Though originally used within the telecommunications industry, it has become common practice across banks, ISPs, insurance firms, and other verticals.
Decision trees are widely used for the machine learning tasks of classification and regression. In this blog post, I’ll help you get started using Apache Spark’s MLlib machine learning decision trees for classification.
Streaming data is of growing interest to many organizations, and most applications need to use a producer-consumer model to ingest and process data in real time. Many messaging solutions exist today on the market, but few of them have been built to handle the challenges of modern deployment related to IoT, large web based applications and related big data projects.
XGBoost is a library that is designed for boosted (tree) algorithms. It has become a popular machine learning framework among data science practitioners, especially on Kaggle, which is a platform for data prediction competitions where researchers post their data and statisticians and data miners compete to produce the best models.
Ted Dunning, Chief Applications Architect for MapR, talks about some newer streaming algorithms such as t-digest and streaming k-means.
Recommendation systems help narrow your choices to those that best meet your particular needs, and they are among the most popular applications of big data processing. In this post we are going to discuss building a recommendation model from movie ratings. We’ll be using an iterative algorithm and parallel processing with Apache Spark MLlib.
In this demo we are using Spark and PySpark to process and analyze the data set, calculate aggregate statistics about the user base in a PySpark script, persist all of that back into MapR-DB for use in Spark and Tableau, and finally use MLlib to build logistic regression models.
Building a good classification model requires leveraging the predictive power from your data and that’s a challenge whether you’re looking at four thousand records or four billion; in machine learning parlance, this step is referred to as “feature extraction”.
Recommendation engines help narrow your choices to those that best meet your particular needs. In this post, we’re going to take a closer look at how all the different components of a recommendation engine work together. We’re going to use collaborative filtering on movie ratings data to recommend movies. The key components are a collaborative filtering algorithm in Apache Mahout to build and train a machine learning model, and search technology from Elasticsearch to simplify deployment of the recommender.
This is the first installment of a two-part series on the value of doing small data analyses on a big data collection. In this first article, we describe the applications and benefits of “small data” in general terms from several different perspectives. In Part 2 of this series, we’ll spend some quality time with one specific algorithm (Local Linear Embedding) from a broader class of algorithms (Manifold Learning) that enable local subsets of data (i.e., small data) to be used in developing a global understanding of the full big data collection.
A project manager once told me that “any job worth doing is worth doing poorly.” I understood exactly what she meant, and she knew that I would understand, especially when she preceded our conversation with these words: “I wouldn’t say this to everyone, but I know you will understand what I mean.” The message was clear to me because I was a perfectionist (and hopefully I have learned over the years to be less of a perfectionist thanks to my project manager’s wise counsel).
The recent Skytree and MapR webinar ”Predictive Analytics with Machine Learning and Hadoop” proved to be highly interactive and engaging. As promised, Nitin and Jin have provided answers to questions that we were not able to get to during the webinar:
Big changes are underway for the open source machine learning project Apache Mahout, and there’s a lot of excitement over this new work. Mahout is a library of very scalable machine learning algorithms that is part of the MapR Distribution for Hadoop. Now these new sweeping changes will make Mahout run enormously faster and much easier to use.
Ted Dunning (Chief Application Architect at MapR) and Ellen Friedman have written a new O’Reilly Media book on “Practical Machine Learning – Innovations in Recommendation" (released in January 2014). This book examines one of the most interesting, fun, and powerful data science applications in the big data universe: recommendation systems.
Scalable machine learning for Apache Hadoop-based systems got a boost recently when the Apache Mahout PMC approved release of the 0.9 version of Mahout. This release is the second in less than a year, and it’s another step toward a stable, mature scalable machine learning library. The open source Apache Mahout community has been very active in the last year, with new releases, active discussions on the user and developer mailing lists, new publications and engagement via Twitter.
Automatic grid search is pretty standard in this sort of area. In general, however, offline evaluation of recommender algorithms is extremely dangerous and potentially quite misleading. Modern practitioners believe that real-time testing is producing better results.
Advantage #1: You can easily mix and match modes without having to move data assets back and forth.
Blog Sign Up
Sign up and get the top posts from each week delivered to your inbox every Friday!