Apache Spark Blog Posts

Posted on January 17, 2017 by Mathieu Dumoulin

Debugging a real-life distributed application can be a pretty daunting task. Most common Google searches don't turn out to be very useful, at least at first. In this blog post, I will give a fairly detailed account of how we managed to accelerate by almost 10x an Apache Kafka/Spark Streaming/Apache Ignite application and turn a development prototype into a useful, stable streaming application that eventually exceeded the performance goals set for the application.

Posted on January 6, 2017 by Ranjit Lingaiah

There has been a lot of research in document image processing over the past 20 years, but not much research has been done in terms of parallel processing. Some of the solutions proposed for parallel processing have been to create threads of execution for each image, or to use GNU Parallel.

Posted on January 5, 2017 by Carol McDonald

The first post discussed creating a machine learning model using Apache Spark’s K-means algorithm to cluster Uber data based on location. This second post will discuss using the saved K-means model with streaming data to do real-time analysis of where and when Uber cars are clustered.

Posted on November 28, 2016 by Carol McDonald

According to Gartner, by 2020, a quarter of a billion connected cars will form a major element of the Internet of Things. Connected vehicles are projected to generate 25GB of data per hour, which can be analyzed to provide real-time monitoring and apps, and will lead to new concepts of mobility and vehicle usage.

Posted on November 3, 2016 by Tugdual Grall

Apache Spark can use various cluster managers to execute applications (such as Standalone, YARN, and Apache Mesos). When you install Apache Spark on MapR, you can submit an application in a Stand Alone mode or by using YARN.

This article focuses on YARN and dynamic allocation, a feature that lets Spark add or remove executors dynamically based on the workload.

Posted on October 17, 2016 by Carol McDonald

In this blog post, I’ll help you get started using Apache Spark’s spark.ml Logistic Regression for predicting cancer malignancy. Spark’s spark.ml library goal is to provide a set of APIs on top of DataFrames that help users create and tune machine learning workflows or pipelines.

Posted on September 6, 2016 by Carol McDonald

This post will help you get started using Apache Spark Streaming for consuming and publishing messages with MapR Streams and the Kafka API. Spark Streaming is an extension of the core Spark API that enables continuous data stream processing.

Posted on August 23, 2016 by Nicolas Perez

The Apache Spark community is thriving, and they have put a lot of effort into extending Spark. Recently, we have been interested in transforming an XML dataset into something that's easier to query. Our main interest is being able to do data exploration on top of billions of transactions that we get every day. In this blog post, I'll walk you through how to use an Apache Spark package from the community to read any XML file into a DataFrame.

Posted on August 17, 2016 by Vinay Bhat

In this week’s Whiteboard Walkthrough, Vinay Bhat, Solution Architect at MapR Technologies, takes you step-by-step through a widespread big data use case: data warehouse offload and building an interactive analytics application using Apache Spark and Apache Drill. Vinay explains how the MapR Converged Data Platform provides unique capabilities to make this process easy and efficient, including support for multi-tenancy.

Posted on August 15, 2016 by Justin Brandenburg

PySpark is a Spark API that allows you to interact with Spark through the Python shell. If you have a Python programming background, this is an excellent way to get introduced to Spark data types and parallel programming. PySpark is a particularly flexible tool for exploratory big data analysis because it integrates with the rest of the Python data analysis ecosystem, including pandas (DataFrames), NumPy (arrays) and Matplotlib (visualization).

Posted on August 8, 2016 by Carol McDonald

This post is the first in a series where we will review examples of how Joe Blue, a Data Scientist in MapR Professional Services, assisted MapR customers in identifying new data sources and applying machine learning algorithms in order to better understand their customers. The first example in the series is an advertising customer 360°; the next example in the series will be banking and healthcare customer 360° examples.

Posted on July 28, 2016 by Nicolas Perez

Logging in Apache Spark is very easy to do, since Spark offers access to a logobject out of the box; only some configuration setups need to be done. In a previous post, we looked at how to do this while identifying some problems that may arise. However, the solution presented might cause some problems when you are ready to collect the logs, since they are distributed across the entire cluster.

Posted on July 14, 2016 by Nick Amato

Sooner or later, if you eyeball enough data sets, you will encounter some that look like a graph, or are best represented a graph. Whether it's social media, computer networks, or interactions between machines, graph representations are often a straightforward choice for representing relationships among one or more entities.

Posted on July 13, 2016 by Philippe Cuzey

As a data analyst that primarily used Apache Pig in the past, I eventually needed to program more challenging jobs that required the use of Apache Spark, a more advanced and flexible language. At first, Spark may look a bit intimidating, but this blog post will show that the transition to Spark (especially PySpark) is quite easy.

Posted on July 12, 2016 by Carol McDonald

Random forests are one of the most successful machine learning models for classification. In this blog post, I’ll help you get started using Apache Spark’s spark.ml Random forests for classification of bank loan credit risk.

Posted on July 8, 2016 by Craig Warman

In this blog post, I’ll describe how to install Apache Drill on the MapR Sandbox for Hadoop, resulting in a "super" sandbox environment that essentially provides the best of both worlds—a fully-functional, single-node MapR/Hadoop/Spark deployment with Apache Drill.

Posted on June 29, 2016 by Carol McDonald

This post will use Apache Spark SQL and DataFrames to query, compare and explore S&P 500, Exxon and Anadarko Petroleum Corporation stock prices.

Posted on June 28, 2016 by Martijn Kieboom

This post describes step by step on how to deploy Mesos, Marathon, Docker and Spark on a MapR cluster and run various jobs as well as Docker containers using this deployment.

Posted on May 10, 2016 by Nicolas Perez

Streaming data is a hot topic these days, and Apache Spark is an excellent framework for streaming. In this blog post, I'll show you how to integrate custom data sources into Spark.

Posted on May 3, 2016 by Carol McDonald

In this post we are going to discuss building a real time solution for credit card fraud detection.

Posted on April 27, 2016 by Mathieu Dumoulin

We have experimented with on a 5 node MapR 5.1 cluster running Spark 1.5.2 and will share our experience, difficulties, and solutions on this blog post.

Posted on April 22, 2016 by Carol McDonald

This post will show how to integrate Apache Spark Streaming, MapR-DB, and MapR Streams for fast, event-driven applications.

Posted on April 20, 2016 by Nick Amato

One of the most useful things to do with machine learning is inform assumptions about customer behaviors. This has a wide variety of applications: everything from helping customers make superior choices (and often, more profitable ones), making them contagiously happy about your business, and building loyalty over time.

Posted on March 23, 2016 by Nicolas Perez

In my last post, we explained how we could use SQL to query our data stored within Hadoop. Our engine is capable of reading CSV files from a distributed file system, auto discovering the schema from the files and exposing them as tables through the Hive meta store. All this was done to be able to connect standard SQL clients to our engine and explore our dataset without manually define the schema of our files, avoiding ETL work.

Posted on March 22, 2016 by Ben Sadeghi

Churn prediction is big business. It minimizes customer defection by predicting which customers are likely to cancel a subscription to a service. Though originally used within the telecommunications industry, it has become common practice across banks, ISPs, insurance firms, and other verticals.

Posted on March 17, 2016 by Nicolas Perez

SQL have been here for awhile and people like it. However, the engines that power SQL have changed with time in order to solve new problems and keep up with demands from consumers.

Posted on March 8, 2016 by Anoop Dawar

In 2015, MapR shipped three significant core releases : 4.0.2 in January, 4.1 in April, 5.0 and the GA version of Apache Drill in July. While all this was happening, many of my colleagues in engineering (who’ve demonstrated a whole new level of ingenuity and multitasking) were also working on one of the biggest releases in the history of MapR—the converged data platform release (AKA, MapR 5.1).

Posted on March 8, 2016 by Carol McDonald

This post will help you get started using Apache Spark GraphX with Scala on the MapR Sandbox. GraphX is the Apache Spark component for graph-parallel computations, built upon a branch of mathematics called graph theory. It is a distributed graph processing framework that sits on top of the Spark core.

Posted on March 1, 2016 by Nicolas Perez

An important part of any application is the underlying log system we incorporate into it. Logs are not only for debugging and traceability, but also for business intelligence. Building a robust logging system within our apps could be use as a great insights of the business problems we are solving.

Posted on February 22, 2016 by Carol McDonald

Decision trees are widely used for the machine learning tasks of classification and regression. In this blog post, I’ll help you get started using Apache Spark’s MLlib machine learning decision trees for classification.

Posted on February 17, 2016 by Sameer Nori

Spark 1.6 is now in Developer Preview on the MapR Converged Data Platform. In this blog post, I’ll share a few details on what Spark 1.6 brings to the table and what you should care about.

Posted on January 28, 2016 by Joseph Blue

This time, it’s personal. Super Bowl 50 is being played at Levi’s Stadium in Santa Clara – within sight of many of the world’s most innovative technology companies, including MapR.

Posted on January 26, 2016 by Jim Scott

Are you ready to start streaming all the events in your business? What happens to your streaming solution when you outgrow your single data center? What happens when you are at a company that is already running multiple data centers and you need to implement streaming across data centers?

Posted on December 29, 2015 by Sameer Nori

We are excited to announce that Spark 1.5.2 is here and is part of the MapR Converged Data Platform. In this blog post, I’ll share a few details on some of the latest capabilities in Spark. If you’re a data engineer, data scientist or in application development, Spark 1.5.2 has new capabilities that you should take advantage of.

Posted on December 17, 2015 by Will Ochandarena

In this week's Whiteboard Walkthrough, Will Ochandarena, Director of Product Management at MapR, explains how we are able to build the MapR Streams capabilities that differentiate us from similar products in the market.

Posted on December 10, 2015 by Mansi Shah

In this week's Whiteboard Walkthrough, Mansi Shah, Senior Staff Engineer at MapR, talks about MapR Streams, a global publish-subscribe event streaming system for big data. Mansi will discuss its architecture and how it lets you deliver your data globally and reliably.

Posted on December 9, 2015 by Carol McDonald

In this post, we will give a high-level overview of the components of MapR Streams. Then, we will follow the life of a message from a producer to a consumer, with an oil rig use case as an example.

Posted on December 8, 2015 by M.C. Srivas

In this week's Whiteboard Walkthrough, MC Srivas, MapR Co-Founder, walks you through the MapR Converged Data Platform that has been in the making for the last 6 years and is now finally complete with MapR Streams.

Posted on November 23, 2015 by Jim Scott

Apache Spark is awesome. Python is awesome. This post will show you how to use your favorite programming language to process large datasets quickly.

Posted on November 19, 2015 by Paul Curtis

Apache Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with Spark SQL, Scala, Hive, Flink, Kylin and more. Zeppelin enables rapid development of Spark and Hadoop workflows with simple, easy visualizations.

Posted on November 4, 2015 by Mitsutoshi Kiuchi

SQL engines for Hadoop differ in their approach and functionality. My focus for this blog post is to compare and contrast the functions and performance of Apache Spark and Apache Drill and discuss their expected use cases.

Posted on November 2, 2015 by Carol McDonald

This blog is a first in a series that discusses some design patterns from the book MapReduce design patterns and shows how these patterns can be implemented in Apache Spark(R).

Posted on October 27, 2015 by Radek Ostrowski

I first heard of Spark in late 2013 when I became interested in Scala, the language in which Spark is written. Some time later, I did a fun data science project trying to predict survival on the Titanic. This turned out to be a great way to get further introduced to Spark concepts and programming. I highly recommend it for any aspiring Spark developers looking for a place to get started.

Posted on September 30, 2015 by Jim Scott

Spark has a very low entry barrier to get started, which eases the burden of learning a new toolset. It is straightforward to download Spark and configure it in standalone mode on a laptop or server for learning and exploration.

Posted on September 11, 2015 by Hao Zhu

In this blog post, I will explain the resource allocation configurations for Spark on YARN, describe the yarn-client and yarn-cluster modes, and will include examples. Spark can request two resources in YARN: CPU and memory.

Posted on September 4, 2015 by Carol McDonald

This post will help you get started using Apache Spark Streaming with HBase on the MapR Sandbox. Spark Streaming is an extension of the core Spark API that enables continuous data stream processing.

Posted on August 17, 2015 by Ted Dunning

Ted Dunning, Chief Applications Architect for MapR, talks about some newer streaming algorithms such as t-digest and streaming k-means.

Posted on August 5, 2015 by Joseph Blue

Did Harper Lee write To Kill a Mockingbird? For many years, conspiracy buffs supported the urban legend that Truman Capote, Lee’s close friend with considerably more literary creds, might have ghost-authored the novel. The author’s reticence on that subject (as well as every other subject) fueled the rumors and it became another urban legend.

Posted on August 3, 2015 by Carol McDonald

Recommendation systems help narrow your choices to those that best meet your particular needs, and they are among the most popular applications of big data processing. In this post we are going to discuss building a recommendation model from movie ratings. We’ll be using an iterative algorithm and parallel processing with Apache Spark MLlib.

Posted on June 30, 2015 by Carol McDonald

This post will help you get started using the Apache Spark Web UI to understand how your Spark application is executing on a Hadoop cluster.

Posted on June 24, 2015 by Carol McDonald

This post will help you get started using Apache Spark DataFrames with Scala on the MapR Sandbox.

Posted on June 17, 2015 by Anoop Dawar

In this week's Whiteboard Walkthrough, Anoop Dawar, Senior Product Director at MapR, shows you the basics of Apache Spark and how it is different from MapReduce.

Posted on June 5, 2015 by Nitin Bandugula

You already know Hadoop as one of the best, cost-effective platforms for deploying large-scale big data applications. But Hadoop is even more powerful when combined with execution capabilities provided by Apache Spark. Although Spark can be used with a number of big data platforms, with the right Hadoop distribution, you can build big data applications quickly using tools you already know.

Posted on May 27, 2015 by Nick Amato

In this demo we are using Spark and PySpark to process and analyze the data set, calculate aggregate statistics about the user base in a PySpark script, persist all of that back into MapR-DB for use in Spark and Tableau, and finally use MLlib to build logistic regression models.

Posted on May 12, 2015 by Nick Amato

In this post, I’ll give an example of how we can make predictions that enable us to maximize revenue and ensure the best customer experience. We'll do this using the output of the Spark code from our last adventure.

Posted on May 11, 2015 by Nick Amato

In this post, I’ll show you how to build a simple real-time dashboard using Spark on MapR.

Posted on May 6, 2015 by Joseph Blue

Building a good classification model requires leveraging the predictive power from your data and that’s a challenge whether you’re looking at four thousand records or four billion; in machine learning parlance, this step is referred to as “feature extraction”.

Posted on December 16, 2014 by Na Yang

Nearly one year ago the Hadoop community began to embrace Apache Spark as a powerful batch processing engine. Today, many organizations and projects are augmenting their Hadoop capabilities with Spark. As part of this trend, the Apache Hive community is working to add Spark as an execution engine for Hive. The Hive-on-Spark work is being tracked by HIVE-7292 which is one of the most popular JIRAs in the Hadoop ecosystem. Furthermore, three weeks ago, the Hive-on-Spark team offered the first demo of Hive on Spark.

Posted on December 1, 2014 by Nitin Bandugula

The November release of the Apache open source packages in MapR was made available for customers earlier this month. We are excited to deliver some major upgrades to existing packages.

Here are the highlights:

Posted on October 30, 2014 by Abhinav Chawade

Hi, welcome to MapR Whiteboard Walkthrough sessions. My name is Abhinav and I'm one of the data engineers here at MapR, and the purpose of this video is to go through the comparison of Storm Trident and Spark Streaming. As you may be aware, Storm and Spark are very popular projects within the community. Storm is a stream processor that came out from Twitter in 2009, and Spark is a general purpose in-memory processing framework, both of which offer stream processing solutions.

Posted on September 5, 2014 by Pat Farrel
Combining a search engine with Mahout has created a recommender that is extremely fast and scalable and seamlessly blends results using collaborative filtering data and metadata. In the first post we described creating a co-occurrence indicator matrix for a recommender. In this follow up post, we dive in deeper to the performance and quality of the recommendations.
Posted on July 18, 2014 by Nitin Bandugula

M.C. Srivas, CTO and Co-Founder of MapR Technologies, spoke recently at Spark Summit 2014 on “Why Spark on Hadoop Matters.” Spark, with an in-memory processing framework, provides a complimentary full stack on Hadoop, and this integration is showing tremendous promise for MapR customers.

Posted on June 13, 2014 by Nitin Bandugula

Large clusters that store enterprise big data for the long run, while exposing that data to a variety of workloads at the same time, are turning out to be the preferred deployment option for Hadoop. This model makes it easy for businesses to avoid data silos and progressively build a full suite of big data applications over time.  

Posted on May 6, 2014 by Michele Nemschoff

Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and there’s been plenty of hype about it in the past several months. In the latest webinar from the Data Science Central webinar series, titled “Let Spark Fly: Advantages and Use Cases for Spark on Hadoop,” we cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal.

Posted on April 10, 2014 by Tomer Shiran
With over 500 paying customers, my team and I have the opportunity to talk to many organizations that are leveraging Hadoop in production to extract value from big data. One of the most common topics raised by our customers in recent months is Apache Spark. Some customers just want to learn more about the advantages of this technology and the use cases that it addresses, while others are already running it in production with the MapR Distribution.

Blog Sign Up

Sign up and get the top posts from each week delivered to your inbox every Friday!

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free