Debugging a real-life distributed application can be a pretty daunting task. Most common Google searches don't turn out to be very useful, at least at first. In this blog post, I will give a fairly detailed account of how we managed to accelerate by almost 10x an Apache Kafka/Spark Streaming/Apache Ignite application and turn a development prototype into a useful, stable streaming application that eventually exceeded the performance goals set for the application.
Apache Spark Blog Posts
There has been a lot of research in document image processing over the past 20 years, but not much research has been done in terms of parallel processing. Some of the solutions proposed for parallel processing have been to create threads of execution for each image, or to use GNU Parallel.
The first post discussed creating a machine learning model using Apache Spark’s K-means algorithm to cluster Uber data based on location. This second post will discuss using the saved K-means model with streaming data to do real-time analysis of where and when Uber cars are clustered.
According to Gartner, by 2020, a quarter of a billion connected cars will form a major element of the Internet of Things. Connected vehicles are projected to generate 25GB of data per hour, which can be analyzed to provide real-time monitoring and apps, and will lead to new concepts of mobility and vehicle usage.
Apache Spark can use various cluster managers to execute applications (such as Standalone, YARN, and Apache Mesos). When you install Apache Spark on MapR, you can submit an application in a Stand Alone mode or by using YARN.
This article focuses on YARN and dynamic allocation, a feature that lets Spark add or remove executors dynamically based on the workload.
In this blog post, I’ll help you get started using Apache Spark’s spark.ml Logistic Regression for predicting cancer malignancy. Spark’s spark.ml library goal is to provide a set of APIs on top of DataFrames that help users create and tune machine learning workflows or pipelines.
This post will help you get started using Apache Spark Streaming for consuming and publishing messages with MapR Streams and the Kafka API. Spark Streaming is an extension of the core Spark API that enables continuous data stream processing.
The Apache Spark community is thriving, and they have put a lot of effort into extending Spark. Recently, we have been interested in transforming an XML dataset into something that's easier to query. Our main interest is being able to do data exploration on top of billions of transactions that we get every day. In this blog post, I'll walk you through how to use an Apache Spark package from the community to read any XML file into a DataFrame.
In this week’s Whiteboard Walkthrough, Vinay Bhat, Solution Architect at MapR Technologies, takes you step-by-step through a widespread big data use case: data warehouse offload and building an interactive analytics application using Apache Spark and Apache Drill. Vinay explains how the MapR Converged Data Platform provides unique capabilities to make this process easy and efficient, including support for multi-tenancy.
PySpark is a Spark API that allows you to interact with Spark through the Python shell. If you have a Python programming background, this is an excellent way to get introduced to Spark data types and parallel programming. PySpark is a particularly flexible tool for exploratory big data analysis because it integrates with the rest of the Python data analysis ecosystem, including pandas (DataFrames), NumPy (arrays) and Matplotlib (visualization).
This post is the first in a series where we will review examples of how Joe Blue, a Data Scientist in MapR Professional Services, assisted MapR customers in identifying new data sources and applying machine learning algorithms in order to better understand their customers. The first example in the series is an advertising customer 360°; the next example in the series will be banking and healthcare customer 360° examples.
Logging in Apache Spark is very easy to do, since Spark offers access to a logobject out of the box; only some configuration setups need to be done. In a previous post, we looked at how to do this while identifying some problems that may arise. However, the solution presented might cause some problems when you are ready to collect the logs, since they are distributed across the entire cluster.
Sooner or later, if you eyeball enough data sets, you will encounter some that look like a graph, or are best represented a graph. Whether it's social media, computer networks, or interactions between machines, graph representations are often a straightforward choice for representing relationships among one or more entities.
As a data analyst that primarily used Apache Pig in the past, I eventually needed to program more challenging jobs that required the use of Apache Spark, a more advanced and flexible language. At first, Spark may look a bit intimidating, but this blog post will show that the transition to Spark (especially PySpark) is quite easy.
Random forests are one of the most successful machine learning models for classification. In this blog post, I’ll help you get started using Apache Spark’s spark.ml Random forests for classification of bank loan credit risk.
In this blog post, I’ll describe how to install Apache Drill on the MapR Sandbox for Hadoop, resulting in a "super" sandbox environment that essentially provides the best of both worlds—a fully-functional, single-node MapR/Hadoop/Spark deployment with Apache Drill.
This post will use Apache Spark SQL and DataFrames to query, compare and explore S&P 500, Exxon and Anadarko Petroleum Corporation stock prices.
This post describes step by step on how to deploy Mesos, Marathon, Docker and Spark on a MapR cluster and run various jobs as well as Docker containers using this deployment.
Streaming data is a hot topic these days, and Apache Spark is an excellent framework for streaming. In this blog post, I'll show you how to integrate custom data sources into Spark.
In this post we are going to discuss building a real time solution for credit card fraud detection.
We have experimented with on a 5 node MapR 5.1 cluster running Spark 1.5.2 and will share our experience, difficulties, and solutions on this blog post.
This post will show how to integrate Apache Spark Streaming, MapR-DB, and MapR Streams for fast, event-driven applications.
One of the most useful things to do with machine learning is inform assumptions about customer behaviors. This has a wide variety of applications: everything from helping customers make superior choices (and often, more profitable ones), making them contagiously happy about your business, and building loyalty over time.
In my last post, we explained how we could use SQL to query our data stored within Hadoop. Our engine is capable of reading CSV files from a distributed file system, auto discovering the schema from the files and exposing them as tables through the Hive meta store. All this was done to be able to connect standard SQL clients to our engine and explore our dataset without manually define the schema of our files, avoiding ETL work.
Churn prediction is big business. It minimizes customer defection by predicting which customers are likely to cancel a subscription to a service. Though originally used within the telecommunications industry, it has become common practice across banks, ISPs, insurance firms, and other verticals.
In 2015, MapR shipped three significant core releases : 4.0.2 in January, 4.1 in April, 5.0 and the GA version of Apache Drill in July. While all this was happening, many of my colleagues in engineering (who’ve demonstrated a whole new level of ingenuity and multitasking) were also working on one of the biggest releases in the history of MapR—the converged data platform release (AKA, MapR 5.1).
This post will help you get started using Apache Spark GraphX with Scala on the MapR Sandbox. GraphX is the Apache Spark component for graph-parallel computations, built upon a branch of mathematics called graph theory. It is a distributed graph processing framework that sits on top of the Spark core.
An important part of any application is the underlying log system we incorporate into it. Logs are not only for debugging and traceability, but also for business intelligence. Building a robust logging system within our apps could be use as a great insights of the business problems we are solving.
Decision trees are widely used for the machine learning tasks of classification and regression. In this blog post, I’ll help you get started using Apache Spark’s MLlib machine learning decision trees for classification.
Spark 1.6 is now in Developer Preview on the MapR Converged Data Platform. In this blog post, I’ll share a few details on what Spark 1.6 brings to the table and what you should care about.
Are you ready to start streaming all the events in your business? What happens to your streaming solution when you outgrow your single data center? What happens when you are at a company that is already running multiple data centers and you need to implement streaming across data centers?
We are excited to announce that Spark 1.5.2 is here and is part of the MapR Converged Data Platform. In this blog post, I’ll share a few details on some of the latest capabilities in Spark. If you’re a data engineer, data scientist or in application development, Spark 1.5.2 has new capabilities that you should take advantage of.
In this week's Whiteboard Walkthrough, Will Ochandarena, Director of Product Management at MapR, explains how we are able to build the MapR Streams capabilities that differentiate us from similar products in the market.
In this week's Whiteboard Walkthrough, Mansi Shah, Senior Staff Engineer at MapR, talks about MapR Streams, a global publish-subscribe event streaming system for big data. Mansi will discuss its architecture and how it lets you deliver your data globally and reliably.
In this post, we will give a high-level overview of the components of MapR Streams. Then, we will follow the life of a message from a producer to a consumer, with an oil rig use case as an example.
In this week's Whiteboard Walkthrough, MC Srivas, MapR Co-Founder, walks you through the MapR Converged Data Platform that has been in the making for the last 6 years and is now finally complete with MapR Streams.
Apache Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with Spark SQL, Scala, Hive, Flink, Kylin and more. Zeppelin enables rapid development of Spark and Hadoop workflows with simple, easy visualizations.
SQL engines for Hadoop differ in their approach and functionality. My focus for this blog post is to compare and contrast the functions and performance of Apache Spark and Apache Drill and discuss their expected use cases.
This blog is a first in a series that discusses some design patterns from the book MapReduce design patterns and shows how these patterns can be implemented in Apache Spark(R).
I first heard of Spark in late 2013 when I became interested in Scala, the language in which Spark is written. Some time later, I did a fun data science project trying to predict survival on the Titanic. This turned out to be a great way to get further introduced to Spark concepts and programming. I highly recommend it for any aspiring Spark developers looking for a place to get started.
Spark has a very low entry barrier to get started, which eases the burden of learning a new toolset. It is straightforward to download Spark and configure it in standalone mode on a laptop or server for learning and exploration.
In this blog post, I will explain the resource allocation configurations for Spark on YARN, describe the yarn-client and yarn-cluster modes, and will include examples. Spark can request two resources in YARN: CPU and memory.
Ted Dunning, Chief Applications Architect for MapR, talks about some newer streaming algorithms such as t-digest and streaming k-means.
Did Harper Lee write To Kill a Mockingbird? For many years, conspiracy buffs supported the urban legend that Truman Capote, Lee’s close friend with considerably more literary creds, might have ghost-authored the novel. The author’s reticence on that subject (as well as every other subject) fueled the rumors and it became another urban legend.
Recommendation systems help narrow your choices to those that best meet your particular needs, and they are among the most popular applications of big data processing. In this post we are going to discuss building a recommendation model from movie ratings. We’ll be using an iterative algorithm and parallel processing with Apache Spark MLlib.
This post will help you get started using Apache Spark DataFrames with Scala on the MapR Sandbox.
In this week's Whiteboard Walkthrough, Anoop Dawar, Senior Product Director at MapR, shows you the basics of Apache Spark and how it is different from MapReduce.
You already know Hadoop as one of the best, cost-effective platforms for deploying large-scale big data applications. But Hadoop is even more powerful when combined with execution capabilities provided by Apache Spark. Although Spark can be used with a number of big data platforms, with the right Hadoop distribution, you can build big data applications quickly using tools you already know.
In this demo we are using Spark and PySpark to process and analyze the data set, calculate aggregate statistics about the user base in a PySpark script, persist all of that back into MapR-DB for use in Spark and Tableau, and finally use MLlib to build logistic regression models.
In this post, I’ll show you how to build a simple real-time dashboard using Spark on MapR.
Building a good classification model requires leveraging the predictive power from your data and that’s a challenge whether you’re looking at four thousand records or four billion; in machine learning parlance, this step is referred to as “feature extraction”.
Nearly one year ago the Hadoop community began to embrace Apache Spark as a powerful batch processing engine. Today, many organizations and projects are augmenting their Hadoop capabilities with Spark. As part of this trend, the Apache Hive community is working to add Spark as an execution engine for Hive. The Hive-on-Spark work is being tracked by HIVE-7292 which is one of the most popular JIRAs in the Hadoop ecosystem. Furthermore, three weeks ago, the Hive-on-Spark team offered the first demo of Hive on Spark.
The November release of the Apache open source packages in MapR was made available for customers earlier this month. We are excited to deliver some major upgrades to existing packages.
Here are the highlights:
Hi, welcome to MapR Whiteboard Walkthrough sessions. My name is Abhinav and I'm one of the data engineers here at MapR, and the purpose of this video is to go through the comparison of Storm Trident and Spark Streaming. As you may be aware, Storm and Spark are very popular projects within the community. Storm is a stream processor that came out from Twitter in 2009, and Spark is a general purpose in-memory processing framework, both of which offer stream processing solutions.
M.C. Srivas, CTO and Co-Founder of MapR Technologies, spoke recently at Spark Summit 2014 on “Why Spark on Hadoop Matters.” Spark, with an in-memory processing framework, provides a complimentary full stack on Hadoop, and this integration is showing tremendous promise for MapR customers.
Large clusters that store enterprise big data for the long run, while exposing that data to a variety of workloads at the same time, are turning out to be the preferred deployment option for Hadoop. This model makes it easy for businesses to avoid data silos and progressively build a full suite of big data applications over time.
Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and there’s been plenty of hype about it in the past several months. In the latest webinar from the Data Science Central webinar series, titled “Let Spark Fly: Advantages and Use Cases for Spark on Hadoop,” we cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal.
Blog Sign Up
Sign up and get the top posts from each week delivered to your inbox every Friday!