Streaming Blog Posts

Posted on January 11, 2017 by Ian Downard

A lot of people choose MapR as their core platform for processing and storing big data because of its advantages for speed and performance. MapR consistently performs faster than any other big data platform for all kinds of applications, including Hadoop, distributed file I/O, NoSQL data storage, and data streaming. In this post, I’m focusing on the latter to provide some perspective on how much better/faster/cheaper MapR Streams can be compared to Apache Kafka as a data streaming technology.

Posted on January 6, 2017 by Ranjit Lingaiah

There has been a lot of research in document image processing over the past 20 years, but not much research has been done in terms of parallel processing. Some of the solutions proposed for parallel processing have been to create threads of execution for each image, or to use GNU Parallel.

Posted on December 9, 2016 by Prashant Rathi

At MapR, we have set up multiple clusters for several of our enterprise customers, and we have brought that knowledge and best practices to MapR Installer. Increasingly, these deployments have not only grown in number, but have also evolved based on the type, purpose, and lifetime for these clusters.

Posted on October 31, 2016 by Ian Downard

One of the challenges when working with streams is the transitory nature of their data. Many applications require data to be persisted far beyond the point at which said data has any practical value to streaming analytics.

Posted on October 26, 2016 by Jorge Geronimo

In this week's Whiteboard Walkthrough Jorge Geronimo, Solutions Architect at MapR, explains how with a single line of code you can create a replica of a MapR data stream within the same cluster or to another cluster even in another part of the world. Jorge also describes multi master replication for streaming data and how MapR Streams' unique capability for geo-distributed replication with preserved offsets offers advantages for working with streaming data.

Posted on October 19, 2016 by Tugdual Grall

Druid is a high-performance, column-oriented, distributed data store. Druid supports streaming data ingestion and offers insights on events immediately after they occur. Druid can ingest data from multiple data sources, including Apache Kafka.

Posted on October 13, 2016 by Tugdual Grall

This article will guide you into the steps to use Apache Flink with MapR Streams. MapR Streams is a distributed messaging system for streaming event data at scale, and it’s integrated into the MapR Converged Data Platform, based on the Apache Kafka API (0.9.0)

Posted on October 6, 2016 by Nick Amato

In this week's whiteboard walkthrough, Nick Amato, Director Technical Marketing at MapR, explains the advantages of a publish-subscribe model for real-time data streams.

Posted on October 5, 2016 by Ted Dunning

In this Whiteboard Walkthrough, MapR’s Chief Application Architect, Ted Dunning, explains the move from state to flow and shows how it works in a financial services example. Ted describes the revolution underway in moving from a traditional system with multiple programs built around a shared database to a new flow-based system that instead uses a shared state queue in the form of a message stream built with technology such as Apache Kafka or MapR Streams. This new architecture lets decisions be made locally and supports a micro-services style approach.

Posted on September 22, 2016 by Raphaël Velfre

A very common use case for the MapR Converged Data Platform is collecting and analyzing data from a variety of sources, including traditional relational databases. Until recently, data engineers would build an ETL pipeline that periodically walks the relational database and loads the data into files on the MapR cluster, then perform batch analytics on that data.

Posted on September 21, 2016 by Stephan Ewen

In this week’s Whiteboard Walkthrough, Stephan Ewen, PMC member of Apache Flink and CTO of data Artisans, describes a valuable capability of Apache Flink stream processing: grouping events together that were observed to occur within a configurable window of time, the event time.

Posted on September 6, 2016 by Carol McDonald

This post will help you get started using Apache Spark Streaming for consuming and publishing messages with MapR Streams and the Kafka API. Spark Streaming is an extension of the core Spark API that enables continuous data stream processing.

Posted on August 29, 2016 by Carol McDonald

Building a robust, responsive, secure data service for healthcare is tricky. For starters, healthcare data lends itself to multiple models: Document representation for patient profile views or updates; Graph representation to query relationships between patients, providers, and medications; Search representation for advanced lookups. This post will describe how stream-first architectures can solve these challenges, and look at how this has been implemented at Liaison Technologies.

Posted on July 21, 2016 by Stephan Ewen

In this week’s Whiteboard Walkthrough, Stephan Ewen, PMC member of Apache Flink and CTO of data Artisans, explains how to use savepoints, a unique feature in Apache Flink stream processing, to let you reprocess data, do bug fixes, deal with upgrades, and do A/B testing.

Posted on June 29, 2016 by Carol McDonald

This post will use Apache Spark SQL and DataFrames to query, compare and explore S&P 500, Exxon and Anadarko Petroleum Corporation stock prices.

Posted on June 23, 2016 by Terry He

In this week's Whiteboard Walkthrough, Terry He, Software Engineer at MapR, walks you through how to tune MapR Streams running an application with Apache Flink for optimal performance.

Posted on May 10, 2016 by Nicolas Perez

Streaming data is a hot topic these days, and Apache Spark is an excellent framework for streaming. In this blog post, I'll show you how to integrate custom data sources into Spark.

Posted on May 3, 2016 by Carol McDonald

In this post we are going to discuss building a real time solution for credit card fraud detection.

Posted on April 22, 2016 by Carol McDonald

This post will show how to integrate Apache Spark Streaming, MapR-DB, and MapR Streams for fast, event-driven applications.

Posted on April 12, 2016 by Kostas Tzoumas

In this post, we focus on a seemingly simple, extremely widespread, but surprisingly difficult (in fact, an unsolved) problem in practice: counting in streams.

Posted on March 10, 2016 by Tugdual Grall

MapR Streams is a new distributed messaging system for streaming event data at scale, and it’s integrated into the MapR converged platform. MapR Streams uses the Apache Kafka API, so if you’re already familiar with Kafka, you’ll find it particularly easy to get started with MapR Streams.

Posted on March 8, 2016 by Anoop Dawar

In 2015, MapR shipped three significant core releases : 4.0.2 in January, 4.1 in April, 5.0 and the GA version of Apache Drill in July. While all this was happening, many of my colleagues in engineering (who’ve demonstrated a whole new level of ingenuity and multitasking) were also working on one of the biggest releases in the history of MapR—the converged data platform release (AKA, MapR 5.1).

Posted on March 8, 2016 by Jim Scott

The distributed computation world has seen a massive shift in the last decade. Apache Hadoop showed up on the scene and brought with it new ways to handle distributed computation at scale. It wasn’t the easiest to work with, and the APIs were far from perfect, but they worked.

Posted on March 3, 2016 by Nick Amato

Moving a data analysis platform from a “submit the job and wait” model to a “make things happen in real-time” one isn’t easy. If it were, the world wouldn’t spend so much time talking about it.

Posted on February 9, 2016 by Tugdual Grall

Streaming data is of growing interest to many organizations, and most applications need to use a producer-consumer model to ingest and process data in real time. Many messaging solutions exist today on the market, but few of them have been built to handle the challenges of modern deployment related to IoT, large web based applications and related big data projects.

Posted on February 3, 2016 by Fabian Hueske

A very large part of today’s data processing is done on data that is continuously produced, e.g., data from user activity logs, web logs, machines, sensors, and database transactions. Until now, data streaming technology was lacking in several areas...

Posted on February 2, 2016 by Will Ochandarena

Two blogs came out recently that share some very interesting perspectives on the blurring lines between architectures and implementation of different data services, ranging from file systems to databases to publish/subscribe streaming services.

Posted on January 26, 2016 by Jim Scott

Are you ready to start streaming all the events in your business? What happens to your streaming solution when you outgrow your single data center? What happens when you are at a company that is already running multiple data centers and you need to implement streaming across data centers?

Posted on December 29, 2015 by Sameer Nori

We are excited to announce that Spark 1.5.2 is here and is part of the MapR Converged Data Platform. In this blog post, I’ll share a few details on some of the latest capabilities in Spark. If you’re a data engineer, data scientist or in application development, Spark 1.5.2 has new capabilities that you should take advantage of.

Posted on December 17, 2015 by Will Ochandarena

In this week's Whiteboard Walkthrough, Will Ochandarena, Director of Product Management at MapR, explains how we are able to build the MapR Streams capabilities that differentiate us from similar products in the market.

Posted on December 10, 2015 by Mansi Shah

In this week's Whiteboard Walkthrough, Mansi Shah, Senior Staff Engineer at MapR, talks about MapR Streams, a global publish-subscribe event streaming system for big data. Mansi will discuss its architecture and how it lets you deliver your data globally and reliably.

Posted on December 9, 2015 by Carol McDonald

In this post, we will give a high-level overview of the components of MapR Streams. Then, we will follow the life of a message from a producer to a consumer, with an oil rig use case as an example.

Posted on December 8, 2015 by M.C. Srivas

In this week's Whiteboard Walkthrough, MC Srivas, MapR Co-Founder, walks you through the MapR Converged Data Platform that has been in the making for the last 6 years and is now finally complete with MapR Streams.

Posted on November 23, 2015 by Jim Scott

Apache Spark is awesome. Python is awesome. This post will show you how to use your favorite programming language to process large datasets quickly.

Posted on November 19, 2015 by Paul Curtis

Apache Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with Spark SQL, Scala, Hive, Flink, Kylin and more. Zeppelin enables rapid development of Spark and Hadoop workflows with simple, easy visualizations.

Posted on November 4, 2015 by Mitsutoshi Kiuchi

SQL engines for Hadoop differ in their approach and functionality. My focus for this blog post is to compare and contrast the functions and performance of Apache Spark and Apache Drill and discuss their expected use cases.

Posted on November 2, 2015 by Carol McDonald

This blog is a first in a series that discusses some design patterns from the book MapReduce design patterns and shows how these patterns can be implemented in Apache Spark(R).

Posted on October 27, 2015 by Radek Ostrowski

I first heard of Spark in late 2013 when I became interested in Scala, the language in which Spark is written. Some time later, I did a fun data science project trying to predict survival on the Titanic. This turned out to be a great way to get further introduced to Spark concepts and programming. I highly recommend it for any aspiring Spark developers looking for a place to get started.

Posted on October 7, 2015 by Henry Saputra

Apache Flink is a top-level Apache project that allows unifying distributed stream and batch processing. In the core of Apache Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

Posted on September 30, 2015 by Jim Scott

Spark has a very low entry barrier to get started, which eases the burden of learning a new toolset. It is straightforward to download Spark and configure it in standalone mode on a laptop or server for learning and exploration.

Posted on September 11, 2015 by Hao Zhu

In this blog post, I will explain the resource allocation configurations for Spark on YARN, describe the yarn-client and yarn-cluster modes, and will include examples. Spark can request two resources in YARN: CPU and memory.

Posted on September 4, 2015 by Carol McDonald

This post will help you get started using Apache Spark Streaming with HBase on the MapR Sandbox. Spark Streaming is an extension of the core Spark API that enables continuous data stream processing.

Posted on August 17, 2015 by Ted Dunning

Ted Dunning, Chief Applications Architect for MapR, talks about some newer streaming algorithms such as t-digest and streaming k-means.

Posted on August 5, 2015 by Joseph Blue

Did Harper Lee write To Kill a Mockingbird? For many years, conspiracy buffs supported the urban legend that Truman Capote, Lee’s close friend with considerably more literary creds, might have ghost-authored the novel. The author’s reticence on that subject (as well as every other subject) fueled the rumors and it became another urban legend.

Posted on August 3, 2015 by Carol McDonald

Recommendation systems help narrow your choices to those that best meet your particular needs, and they are among the most popular applications of big data processing. In this post we are going to discuss building a recommendation model from movie ratings. We’ll be using an iterative algorithm and parallel processing with Apache Spark MLlib.

Posted on June 30, 2015 by Carol McDonald

This post will help you get started using the Apache Spark Web UI to understand how your Spark application is executing on a Hadoop cluster.

Posted on June 24, 2015 by Carol McDonald

This post will help you get started using Apache Spark DataFrames with Scala on the MapR Sandbox.

Posted on June 17, 2015 by Anoop Dawar

In this week's Whiteboard Walkthrough, Anoop Dawar, Senior Product Director at MapR, shows you the basics of Apache Spark and how it is different from MapReduce.

Posted on June 5, 2015 by Nitin Bandugula

You already know Hadoop as one of the best, cost-effective platforms for deploying large-scale big data applications. But Hadoop is even more powerful when combined with execution capabilities provided by Apache Spark. Although Spark can be used with a number of big data platforms, with the right Hadoop distribution, you can build big data applications quickly using tools you already know.

Posted on May 27, 2015 by Nick Amato

In this demo we are using Spark and PySpark to process and analyze the data set, calculate aggregate statistics about the user base in a PySpark script, persist all of that back into MapR-DB for use in Spark and Tableau, and finally use MLlib to build logistic regression models.

Posted on May 12, 2015 by Nick Amato

In this post, I’ll give an example of how we can make predictions that enable us to maximize revenue and ensure the best customer experience. We'll do this using the output of the Spark code from our last adventure.

Posted on May 11, 2015 by Nick Amato

In this post, I’ll show you how to build a simple real-time dashboard using Spark on MapR.

Posted on May 6, 2015 by Joseph Blue

Building a good classification model requires leveraging the predictive power from your data and that’s a challenge whether you’re looking at four thousand records or four billion; in machine learning parlance, this step is referred to as “feature extraction”.

Posted on February 26, 2015 by James Casaletto

In this blog post, we introduce the concept of using non-Java programs or streaming for MapReduce jobs. MapReduce’s streaming feature allows programmers to use languages other than Java such as Perl or Python to write their MapReduce programs. You can use streaming for either rapid prototyping using sed/awk, or for full-blown MapReduce deployments. Note that the streaming feature does not include C++ programs – these are supported through a similar feature called pipes.

Posted on January 19, 2015 by Michael Hausenblas

In this series of blog posts on the Internet of Things (IoT), we've initially established why IoT naturally lends itself to big data, reviewed the current IoT landscape and had a look at some IoT use cases (smart cities, smart phones, and smart homes). In this post, we'll discuss requirements for an IoT data processing platform as well as introduce a high-level architecture that is able to meet the requirements.

Posted on December 16, 2014 by Na Yang

Nearly one year ago the Hadoop community began to embrace Apache Spark as a powerful batch processing engine. Today, many organizations and projects are augmenting their Hadoop capabilities with Spark. As part of this trend, the Apache Hive community is working to add Spark as an execution engine for Hive. The Hive-on-Spark work is being tracked by HIVE-7292 which is one of the most popular JIRAs in the Hadoop ecosystem. Furthermore, three weeks ago, the Hive-on-Spark team offered the first demo of Hive on Spark.

Posted on December 1, 2014 by Nitin Bandugula

The November release of the Apache open source packages in MapR was made available for customers earlier this month. We are excited to deliver some major upgrades to existing packages.

Here are the highlights:

Posted on October 30, 2014 by Abhinav Chawade

Hi, welcome to MapR Whiteboard Walkthrough sessions. My name is Abhinav and I'm one of the data engineers here at MapR, and the purpose of this video is to go through the comparison of Storm Trident and Spark Streaming. As you may be aware, Storm and Spark are very popular projects within the community. Storm is a stream processor that came out from Twitter in 2009, and Spark is a general purpose in-memory processing framework, both of which offer stream processing solutions.

Posted on September 29, 2014 by Nitin Bandugula

The capability to process live data streams enables businesses to make real-time, data-driven decisions. The decisions could be based on simple data aggregation rules or even complex business logic. The engines that support these decision models have to be fast, scalable and reliable and Hadoop, with its rapidly growing ecosystem, is fast emerging as the data platform that supports such real-time stream processing engines.

Posted on September 5, 2014 by Pat Farrel
Combining a search engine with Mahout has created a recommender that is extremely fast and scalable and seamlessly blends results using collaborative filtering data and metadata. In the first post we described creating a co-occurrence indicator matrix for a recommender. In this follow up post, we dive in deeper to the performance and quality of the recommendations.
Posted on July 18, 2014 by Nitin Bandugula

M.C. Srivas, CTO and Co-Founder of MapR Technologies, spoke recently at Spark Summit 2014 on “Why Spark on Hadoop Matters.” Spark, with an in-memory processing framework, provides a complimentary full stack on Hadoop, and this integration is showing tremendous promise for MapR customers.

Posted on June 13, 2014 by Nitin Bandugula

Large clusters that store enterprise big data for the long run, while exposing that data to a variety of workloads at the same time, are turning out to be the preferred deployment option for Hadoop. This model makes it easy for businesses to avoid data silos and progressively build a full suite of big data applications over time.  

Posted on May 6, 2014 by Michele Nemschoff

Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and there’s been plenty of hype about it in the past several months. In the latest webinar from the Data Science Central webinar series, titled “Let Spark Fly: Advantages and Use Cases for Spark on Hadoop,” we cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal.

Posted on April 10, 2014 by Tomer Shiran
With over 500 paying customers, my team and I have the opportunity to talk to many organizations that are leveraging Hadoop in production to extract value from big data. One of the most common topics raised by our customers in recent months is Apache Spark. Some customers just want to learn more about the advantages of this technology and the use cases that it addresses, while others are already running it in production with the MapR Distribution.

Blog Sign Up

Sign up and get the top posts from each week delivered to your inbox every Friday!

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free