A lot of people choose MapR as their core platform for processing and storing big data because of its advantages for speed and performance. MapR consistently performs faster than any other big data platform for all kinds of applications, including Hadoop, distributed file I/O, NoSQL data storage, and data streaming. In this post, I’m focusing on the latter to provide some perspective on how much better/faster/cheaper MapR Streams can be compared to Apache Kafka as a data streaming technology.
Streaming Blog Posts
There has been a lot of research in document image processing over the past 20 years, but not much research has been done in terms of parallel processing. Some of the solutions proposed for parallel processing have been to create threads of execution for each image, or to use GNU Parallel.
At MapR, we have set up multiple clusters for several of our enterprise customers, and we have brought that knowledge and best practices to MapR Installer. Increasingly, these deployments have not only grown in number, but have also evolved based on the type, purpose, and lifetime for these clusters.
One of the challenges when working with streams is the transitory nature of their data. Many applications require data to be persisted far beyond the point at which said data has any practical value to streaming analytics.
In this week's Whiteboard Walkthrough Jorge Geronimo, Solutions Architect at MapR, explains how with a single line of code you can create a replica of a MapR data stream within the same cluster or to another cluster even in another part of the world. Jorge also describes multi master replication for streaming data and how MapR Streams' unique capability for geo-distributed replication with preserved offsets offers advantages for working with streaming data.
Druid is a high-performance, column-oriented, distributed data store. Druid supports streaming data ingestion and offers insights on events immediately after they occur. Druid can ingest data from multiple data sources, including Apache Kafka.
This article will guide you into the steps to use Apache Flink with MapR Streams. MapR Streams is a distributed messaging system for streaming event data at scale, and it’s integrated into the MapR Converged Data Platform, based on the Apache Kafka API (0.9.0)
In this week's whiteboard walkthrough, Nick Amato, Director Technical Marketing at MapR, explains the advantages of a publish-subscribe model for real-time data streams.
In this Whiteboard Walkthrough, MapR’s Chief Application Architect, Ted Dunning, explains the move from state to flow and shows how it works in a financial services example. Ted describes the revolution underway in moving from a traditional system with multiple programs built around a shared database to a new flow-based system that instead uses a shared state queue in the form of a message stream built with technology such as Apache Kafka or MapR Streams. This new architecture lets decisions be made locally and supports a micro-services style approach.
A very common use case for the MapR Converged Data Platform is collecting and analyzing data from a variety of sources, including traditional relational databases. Until recently, data engineers would build an ETL pipeline that periodically walks the relational database and loads the data into files on the MapR cluster, then perform batch analytics on that data.
In this week’s Whiteboard Walkthrough, Stephan Ewen, PMC member of Apache Flink and CTO of data Artisans, describes a valuable capability of Apache Flink stream processing: grouping events together that were observed to occur within a configurable window of time, the event time.
This post will help you get started using Apache Spark Streaming for consuming and publishing messages with MapR Streams and the Kafka API. Spark Streaming is an extension of the core Spark API that enables continuous data stream processing.
Building a robust, responsive, secure data service for healthcare is tricky. For starters, healthcare data lends itself to multiple models: Document representation for patient profile views or updates; Graph representation to query relationships between patients, providers, and medications; Search representation for advanced lookups. This post will describe how stream-first architectures can solve these challenges, and look at how this has been implemented at Liaison Technologies.
In this week’s Whiteboard Walkthrough, Stephan Ewen, PMC member of Apache Flink and CTO of data Artisans, explains how to use savepoints, a unique feature in Apache Flink stream processing, to let you reprocess data, do bug fixes, deal with upgrades, and do A/B testing.
This post will use Apache Spark SQL and DataFrames to query, compare and explore S&P 500, Exxon and Anadarko Petroleum Corporation stock prices.
In this week's Whiteboard Walkthrough, Terry He, Software Engineer at MapR, walks you through how to tune MapR Streams running an application with Apache Flink for optimal performance.
Streaming data is a hot topic these days, and Apache Spark is an excellent framework for streaming. In this blog post, I'll show you how to integrate custom data sources into Spark.
In this post we are going to discuss building a real time solution for credit card fraud detection.
This post will show how to integrate Apache Spark Streaming, MapR-DB, and MapR Streams for fast, event-driven applications.
MapR Streams is a new distributed messaging system for streaming event data at scale, and it’s integrated into the MapR converged platform. MapR Streams uses the Apache Kafka API, so if you’re already familiar with Kafka, you’ll find it particularly easy to get started with MapR Streams.
In 2015, MapR shipped three significant core releases : 4.0.2 in January, 4.1 in April, 5.0 and the GA version of Apache Drill in July. While all this was happening, many of my colleagues in engineering (who’ve demonstrated a whole new level of ingenuity and multitasking) were also working on one of the biggest releases in the history of MapR—the converged data platform release (AKA, MapR 5.1).
The distributed computation world has seen a massive shift in the last decade. Apache Hadoop showed up on the scene and brought with it new ways to handle distributed computation at scale. It wasn’t the easiest to work with, and the APIs were far from perfect, but they worked.
Moving a data analysis platform from a “submit the job and wait” model to a “make things happen in real-time” one isn’t easy. If it were, the world wouldn’t spend so much time talking about it.
Streaming data is of growing interest to many organizations, and most applications need to use a producer-consumer model to ingest and process data in real time. Many messaging solutions exist today on the market, but few of them have been built to handle the challenges of modern deployment related to IoT, large web based applications and related big data projects.
A very large part of today’s data processing is done on data that is continuously produced, e.g., data from user activity logs, web logs, machines, sensors, and database transactions. Until now, data streaming technology was lacking in several areas...
Two blogs came out recently that share some very interesting perspectives on the blurring lines between architectures and implementation of different data services, ranging from file systems to databases to publish/subscribe streaming services.
Are you ready to start streaming all the events in your business? What happens to your streaming solution when you outgrow your single data center? What happens when you are at a company that is already running multiple data centers and you need to implement streaming across data centers?
We are excited to announce that Spark 1.5.2 is here and is part of the MapR Converged Data Platform. In this blog post, I’ll share a few details on some of the latest capabilities in Spark. If you’re a data engineer, data scientist or in application development, Spark 1.5.2 has new capabilities that you should take advantage of.
In this week's Whiteboard Walkthrough, Will Ochandarena, Director of Product Management at MapR, explains how we are able to build the MapR Streams capabilities that differentiate us from similar products in the market.
In this week's Whiteboard Walkthrough, Mansi Shah, Senior Staff Engineer at MapR, talks about MapR Streams, a global publish-subscribe event streaming system for big data. Mansi will discuss its architecture and how it lets you deliver your data globally and reliably.
In this post, we will give a high-level overview of the components of MapR Streams. Then, we will follow the life of a message from a producer to a consumer, with an oil rig use case as an example.
In this week's Whiteboard Walkthrough, MC Srivas, MapR Co-Founder, walks you through the MapR Converged Data Platform that has been in the making for the last 6 years and is now finally complete with MapR Streams.
Apache Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with Spark SQL, Scala, Hive, Flink, Kylin and more. Zeppelin enables rapid development of Spark and Hadoop workflows with simple, easy visualizations.
SQL engines for Hadoop differ in their approach and functionality. My focus for this blog post is to compare and contrast the functions and performance of Apache Spark and Apache Drill and discuss their expected use cases.
This blog is a first in a series that discusses some design patterns from the book MapReduce design patterns and shows how these patterns can be implemented in Apache Spark(R).
I first heard of Spark in late 2013 when I became interested in Scala, the language in which Spark is written. Some time later, I did a fun data science project trying to predict survival on the Titanic. This turned out to be a great way to get further introduced to Spark concepts and programming. I highly recommend it for any aspiring Spark developers looking for a place to get started.
Apache Flink is a top-level Apache project that allows unifying distributed stream and batch processing. In the core of Apache Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
Spark has a very low entry barrier to get started, which eases the burden of learning a new toolset. It is straightforward to download Spark and configure it in standalone mode on a laptop or server for learning and exploration.
In this blog post, I will explain the resource allocation configurations for Spark on YARN, describe the yarn-client and yarn-cluster modes, and will include examples. Spark can request two resources in YARN: CPU and memory.
Ted Dunning, Chief Applications Architect for MapR, talks about some newer streaming algorithms such as t-digest and streaming k-means.
Did Harper Lee write To Kill a Mockingbird? For many years, conspiracy buffs supported the urban legend that Truman Capote, Lee’s close friend with considerably more literary creds, might have ghost-authored the novel. The author’s reticence on that subject (as well as every other subject) fueled the rumors and it became another urban legend.
Recommendation systems help narrow your choices to those that best meet your particular needs, and they are among the most popular applications of big data processing. In this post we are going to discuss building a recommendation model from movie ratings. We’ll be using an iterative algorithm and parallel processing with Apache Spark MLlib.
This post will help you get started using Apache Spark DataFrames with Scala on the MapR Sandbox.
In this week's Whiteboard Walkthrough, Anoop Dawar, Senior Product Director at MapR, shows you the basics of Apache Spark and how it is different from MapReduce.
You already know Hadoop as one of the best, cost-effective platforms for deploying large-scale big data applications. But Hadoop is even more powerful when combined with execution capabilities provided by Apache Spark. Although Spark can be used with a number of big data platforms, with the right Hadoop distribution, you can build big data applications quickly using tools you already know.
In this demo we are using Spark and PySpark to process and analyze the data set, calculate aggregate statistics about the user base in a PySpark script, persist all of that back into MapR-DB for use in Spark and Tableau, and finally use MLlib to build logistic regression models.
In this post, I’ll show you how to build a simple real-time dashboard using Spark on MapR.
Building a good classification model requires leveraging the predictive power from your data and that’s a challenge whether you’re looking at four thousand records or four billion; in machine learning parlance, this step is referred to as “feature extraction”.
In this blog post, we introduce the concept of using non-Java programs or streaming for MapReduce jobs. MapReduce’s streaming feature allows programmers to use languages other than Java such as Perl or Python to write their MapReduce programs. You can use streaming for either rapid prototyping using sed/awk, or for full-blown MapReduce deployments. Note that the streaming feature does not include C++ programs – these are supported through a similar feature called pipes.
In this series of blog posts on the Internet of Things (IoT), we've initially established why IoT naturally lends itself to big data, reviewed the current IoT landscape and had a look at some IoT use cases (smart cities, smart phones, and smart homes). In this post, we'll discuss requirements for an IoT data processing platform as well as introduce a high-level architecture that is able to meet the requirements.
Nearly one year ago the Hadoop community began to embrace Apache Spark as a powerful batch processing engine. Today, many organizations and projects are augmenting their Hadoop capabilities with Spark. As part of this trend, the Apache Hive community is working to add Spark as an execution engine for Hive. The Hive-on-Spark work is being tracked by HIVE-7292 which is one of the most popular JIRAs in the Hadoop ecosystem. Furthermore, three weeks ago, the Hive-on-Spark team offered the first demo of Hive on Spark.
The November release of the Apache open source packages in MapR was made available for customers earlier this month. We are excited to deliver some major upgrades to existing packages.
Here are the highlights:
Hi, welcome to MapR Whiteboard Walkthrough sessions. My name is Abhinav and I'm one of the data engineers here at MapR, and the purpose of this video is to go through the comparison of Storm Trident and Spark Streaming. As you may be aware, Storm and Spark are very popular projects within the community. Storm is a stream processor that came out from Twitter in 2009, and Spark is a general purpose in-memory processing framework, both of which offer stream processing solutions.
The capability to process live data streams enables businesses to make real-time, data-driven decisions. The decisions could be based on simple data aggregation rules or even complex business logic. The engines that support these decision models have to be fast, scalable and reliable and Hadoop, with its rapidly growing ecosystem, is fast emerging as the data platform that supports such real-time stream processing engines.
M.C. Srivas, CTO and Co-Founder of MapR Technologies, spoke recently at Spark Summit 2014 on “Why Spark on Hadoop Matters.” Spark, with an in-memory processing framework, provides a complimentary full stack on Hadoop, and this integration is showing tremendous promise for MapR customers.
Large clusters that store enterprise big data for the long run, while exposing that data to a variety of workloads at the same time, are turning out to be the preferred deployment option for Hadoop. This model makes it easy for businesses to avoid data silos and progressively build a full suite of big data applications over time.
Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and there’s been plenty of hype about it in the past several months. In the latest webinar from the Data Science Central webinar series, titled “Let Spark Fly: Advantages and Use Cases for Spark on Hadoop,” we cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal.
Blog Sign Up
Sign up and get the top posts from each week delivered to your inbox every Friday!