MapR Platform Blog Posts

Posted on October 24, 2016 by Sameer Nori

Offloading cold or unused data and ETL workloads from a data warehouse to Hadoop/big data platforms is a very common starting point for enterprises beginning their big data journey. Platforms like Hadoop provide an economical way to store data and do bulk processing of large data sets; hence, it’s not surprising that cost is the primary driver for this initial use case.

Posted on October 19, 2016 by Tugdual Grall

Druid is a high-performance, column-oriented, distributed data store. Druid supports streaming data ingestion and offers insights on events immediately after they occur. Druid can ingest data from multiple data sources, including Apache Kafka.

Posted on October 18, 2016 by James Sun

If you’ve been keeping tabs on all the great product enhancements that have been coming out of MapR, you will know that the 5.2 version of the MapR Converged Data Platform went GA this summer. It takes a few cycles to make the platform available on the AWS marketplace, largely due to the testing efforts required.

Posted on October 13, 2016 by Tugdual Grall

This article will guide you into the steps to use Apache Flink with MapR Streams. MapR Streams is a distributed messaging system for streaming event data at scale, and it’s integrated into the MapR Converged Data Platform, based on the Apache Kafka API (0.9.0)

Posted on October 6, 2016 by Nick Amato

In this week's whiteboard walkthrough, Nick Amato, Director Technical Marketing at MapR, explains the advantages of a publish-subscribe model for real-time data streams.

Posted on October 5, 2016 by Ted Dunning

In this Whiteboard Walkthrough, MapR’s Chief Application Architect, Ted Dunning, explains the move from state to flow and shows how it works in a financial services example. Ted describes the revolution underway in moving from a traditional system with multiple programs built around a shared database to a new flow-based system that instead uses a shared state queue in the form of a message stream built with technology such as Apache Kafka or MapR Streams. This new architecture lets decisions be made locally and supports a micro-services style approach.

Posted on September 22, 2016 by Raphaël Velfre

A very common use case for the MapR Converged Data Platform is collecting and analyzing data from a variety of sources, including traditional relational databases. Until recently, data engineers would build an ETL pipeline that periodically walks the relational database and loads the data into files on the MapR cluster, then perform batch analytics on that data.

Posted on September 14, 2016 by Neeraja Rentachintala

Today we are excited to announce the availability of Drill 1.8 on the MapR Converged Data Platform. As part of the Apache Drill community, we continue to deliver iterative releases of Drill, providing significant feature enhancements along with enterprise readiness improvements based on feedback from a variety of customer deployments.

Posted on September 12, 2016 by Vikash Selvin

Elasticsearch and Kibana are widely used in the market today for data analytics; however, security is one aspect that was not initially built in to the product. Since data is the lifeline of any organization today, it becomes essential that Elasticsearch and Kibana be “secured.” In this blog post, we will be looking at one of the ways in which authentication, authorization, and encryption can be implemented for them.

Posted on September 6, 2016 by Carol McDonald

This post will help you get started using Apache Spark Streaming for consuming and publishing messages with MapR Streams and the Kafka API. Spark Streaming is an extension of the core Spark API that enables continuous data stream processing.

Posted on August 31, 2016 by Mitesh Shah

Mitesh Shah, Senior Product Manager for Security and Data Governance at MapR, describes an important concept in big data security: vulnerability management, a key layer in the trust model. He explains what is provided by the MapR Converged Data Platform and what the role is of the customer in maintaining a secure environment.

Posted on August 30, 2016 by Dong Meng

Apache PredicitonIO is an open source machine learning server. In this article, we integrate Apache PredictionIO with the MapR Converged Data Platform 5.1 as a backend. Specifically, we use MapR-DB (1.1.1) for event data storage, ElasticSearch for metadata storage, and MapR-FS for model data storage.

Posted on August 29, 2016 by Carol McDonald

Building a robust, responsive, secure data service for healthcare is tricky. For starters, healthcare data lends itself to multiple models: Document representation for patient profile views or updates; Graph representation to query relationships between patients, providers, and medications; Search representation for advanced lookups. This post will describe how stream-first architectures can solve these challenges, and look at how this has been implemented at Liaison Technologies.

Posted on August 17, 2016 by Vinay Bhat

In this week’s Whiteboard Walkthrough, Vinay Bhat, Solution Architect at MapR Technologies, takes you step-by-step through a widespread big data use case: data warehouse offload and building an interactive analytics application using Apache Spark and Apache Drill. Vinay explains how the MapR Converged Data Platform provides unique capabilities to make this process easy and efficient, including support for multi-tenancy.

Posted on August 15, 2016 by Justin Brandenburg

PySpark is a Spark API that allows you to interact with Spark through the Python shell. If you have a Python programming background, this is an excellent way to get introduced to Spark data types and parallel programming. PySpark is a particularly flexible tool for exploratory big data analysis because it integrates with the rest of the Python data analysis ecosystem, including pandas (DataFrames), NumPy (arrays) and Matplotlib (visualization).

Posted on August 4, 2016 by Dong Meng

In the big data enterprise ecosystem, there are always new choices when it comes to analytics and data science. Apache incubates so many projects that people are always confused as to how to go about choosing an appropriate ecosystem project. In the data science pipeline, ad-hoc query is an important aspect, which gives users the ability to run different queries that will lead to exploratory statistics that will help them understand their data.

Posted on July 21, 2016 by Stephan Ewen

In this week’s Whiteboard Walkthrough, Stephan Ewen, PMC member of Apache Flink and CTO of data Artisans, explains how to use savepoints, a unique feature in Apache Flink stream processing, to let you reprocess data, do bug fixes, deal with upgrades, and do A/B testing.

Posted on July 15, 2016 by Ryan Victory

MapR Streams and MapR-DB are both very exciting developments in the MapR Converged Data Platform. In this blog post, I’m going to show you how to get Ruby code to natively interact with MapR-DB and MapR Streams.

Posted on July 14, 2016 by Nick Amato

Sooner or later, if you eyeball enough data sets, you will encounter some that look like a graph, or are best represented a graph. Whether it's social media, computer networks, or interactions between machines, graph representations are often a straightforward choice for representing relationships among one or more entities.

Posted on July 11, 2016 by Prashant Rathi

In this week’s Whiteboard Walkthrough, Prashant Rathi, Senior Product Manager at MapR, describes the architecture for fine-grained monitoring in the MapR converged data platform from collection to storage and visualization from a variety of data sources as part of the Spyglass Initiative.

Posted on July 8, 2016 by Craig Warman

In this blog post, I’ll describe how to install Apache Drill on the MapR Sandbox for Hadoop, resulting in a "super" sandbox environment that essentially provides the best of both worlds—a fully-functional, single-node MapR/Hadoop/Spark deployment with Apache Drill.

Posted on June 28, 2016 by Martijn Kieboom

This post describes step by step on how to deploy Mesos, Marathon, Docker and Spark on a MapR cluster and run various jobs as well as Docker containers using this deployment.

Posted on June 27, 2016 by Vince Gonzalez

MapR-FS provides some very useful capabilities for data management and access control. These features can and should be applied to user home directories. A user in a MapR cluster has a lot of capability at their fingertips. They can create files, two styles of NoSQL tables, and pub/sub messaging streams with many thousands of topics.

Posted on June 23, 2016 by Terry He

In this week's Whiteboard Walkthrough, Terry He, Software Engineer at MapR, walks you through how to tune MapR Streams running an application with Apache Flink for optimal performance.

Posted on June 13, 2016 by Ellen Friedman

The power of SQL for business analytics is a given, but the challenge in big data settings is that SQL is normally a static language that assumes pre-defined, fixed and well-known schema. SQL also needs flat data structures. It has been assumed that you need fixed schema for performance.

Posted on June 1, 2016 by Dale Kim

In this week's Whiteboard Walkthrough, Dale Kim, Director of Industry Solutions at MapR, explains the architectural differences between MapR-FS or the MapR File System, and the Hadoop Distributed File System (HDFS).

Posted on May 27, 2016 by Jimit Shah

The ability to store and retrieve JSON documents using the OJAI standard has introduced a very powerful way to work with data in your MapR cluster.

Posted on May 4, 2016 by Nick Amato

If you’ve had a chance to work with Hadoop or Spark a little, you probably already know that HDFS doesn't support full random read-writes or many other capabilities typically required in a production-ready file system.

Posted on May 3, 2016 by Carol McDonald

In this post we are going to discuss building a real time solution for credit card fraud detection.

Posted on April 29, 2016 by Mathieu Dumoulin

There are many options for monitoring the performance and health of a MapR cluster. In this post, I will present the lesser-known method for monitoring the CLDB using the Java Management Extensions (JMX).

Posted on April 27, 2016 by Mathieu Dumoulin

We have experimented with on a 5 node MapR 5.1 cluster running Spark 1.5.2 and will share our experience, difficulties, and solutions on this blog post.

Posted on April 22, 2016 by Carol McDonald

This post will show how to integrate Apache Spark Streaming, MapR-DB, and MapR Streams for fast, event-driven applications.

Posted on April 21, 2016 by Leon Clayton

In this article we will explore what it means to have a converged data platform for building and delivering business applications. This sample application will be to create blog articles for a personal website.

Posted on April 6, 2016 by Neeraja Rentachintala

Today we are very excited to announce the release of Apache Drill 1.6 on the MapR Converged Data Platform. Drill has been on the path of rapid iterative releases for one and a half years now, gathering amazing traction with customers and OSS community users on the way.

Posted on March 16, 2016 by Mitesh Shah

In this week's Whiteboard Walkthrough, Mitesh Shah, Product Management at MapR, describes how you can track administrative operations and data accesses in your MapR Converged Data Platform in a seamless and efficient way with the built-in auditing feature.

Posted on March 10, 2016 by Tugdual Grall

MapR Streams is a new distributed messaging system for streaming event data at scale, and it’s integrated into the MapR converged platform. MapR Streams uses the Apache Kafka API, so if you’re already familiar with Kafka, you’ll find it particularly easy to get started with MapR Streams.

Posted on March 9, 2016 by Mitesh Shah

In this week's Whiteboard Walkthrough, Mitesh Shah, Product Management at MapR, describes how you can make sure you aren’t opening more access permissions to your sensitive data in Hadoop than you intended, using File Access Control Expressions in MapR.

Posted on March 8, 2016 by Anoop Dawar

In 2015, MapR shipped three significant core releases : 4.0.2 in January, 4.1 in April, 5.0 and the GA version of Apache Drill in July. While all this was happening, many of my colleagues in engineering (who’ve demonstrated a whole new level of ingenuity and multitasking) were also working on one of the biggest releases in the history of MapR—the converged data platform release (AKA, MapR 5.1).

Posted on February 25, 2016 by Harish Thakkallapally

In this article, I am going to show you how to use the impersonation feature to create and access a file in MapR-FS. In this example, we will run a Java program as the “mapr” superuser that will run operations on behalf of the “user01” user.

Posted on February 17, 2016 by Sameer Nori

Spark 1.6 is now in Developer Preview on the MapR Converged Data Platform. In this blog post, I’ll share a few details on what Spark 1.6 brings to the table and what you should care about.

Posted on February 8, 2016 by Nelson Estrada

This is the second post in a series about the MapR Command Line Interface. The first post gave you an idea of how to use the command line to review your cluster nodes, what services were running, and how to manage them.

Posted on February 2, 2016 by Will Ochandarena

Two blogs came out recently that share some very interesting perspectives on the blurring lines between architectures and implementation of different data services, ranging from file systems to databases to publish/subscribe streaming services.

Posted on January 26, 2016 by Jim Scott

Are you ready to start streaming all the events in your business? What happens to your streaming solution when you outgrow your single data center? What happens when you are at a company that is already running multiple data centers and you need to implement streaming across data centers?

Posted on January 25, 2016 by Ranjit Lingaiah

In the wide column data model of MapR-DB, all rows are stored by a row key, column family, column qualifier, value, and timestamps. In the current version, the row key is the only field that is indexed, which fits the common pattern of queries based on the row key.

Posted on January 7, 2016 by Dong Meng

XGBoost is a library that is designed for boosted (tree) algorithms. It has become a popular machine learning framework among data science practitioners, especially on Kaggle, which is a platform for data prediction competitions where researchers post their data and statisticians and data miners compete to produce the best models.

Posted on December 17, 2015 by Will Ochandarena

In this week's Whiteboard Walkthrough, Will Ochandarena, Director of Product Management at MapR, explains how we are able to build the MapR Streams capabilities that differentiate us from similar products in the market.

Posted on December 10, 2015 by Mansi Shah

In this week's Whiteboard Walkthrough, Mansi Shah, Senior Staff Engineer at MapR, talks about MapR Streams, a global publish-subscribe event streaming system for big data. Mansi will discuss its architecture and how it lets you deliver your data globally and reliably.

Posted on December 9, 2015 by Carol McDonald

In this post, we will give a high-level overview of the components of MapR Streams. Then, we will follow the life of a message from a producer to a consumer, with an oil rig use case as an example.

Posted on December 8, 2015 by M.C. Srivas

In this week's Whiteboard Walkthrough, MC Srivas, MapR Co-Founder, walks you through the MapR Converged Data Platform that has been in the making for the last 6 years and is now finally complete with MapR Streams.

Posted on November 11, 2015 by Bruce Penn

In this week's Whiteboard Walkthrough, Bruce Penn, Sales Engineer at MapR, explains how NFS access works on MapR-FS vs. HDFS, and helps you decide what you would use in production.

Posted on November 10, 2015 by Nick Amato

This blog describes how to get an instance of the MapR-DB Document Database Developer Preview image running on Amazon AWS using one of the pre-configured AMI images supplied by MapR. With this AMI, you can start writing JSON-based applications on MapR-DB using the open source Open JSON Application Interface, or OJAI.

Posted on November 3, 2015 by Nick Amato

Handling large JSON-based data sets in Hadoop or Spark can be a project unto itself. Endless hours toiling away into obscurity with complicated transformations, extractions, handling the nuances of database connectors, and flattening ‘till the cows come home is the name of the game.

Posted on October 23, 2015 by Aditya Kishore

In this week's Whiteboard Walkthrough, Aditya Kishore, engineer on the MapR-DB team, explains how to use the OJAI API to insert, search, and update the document database.

Posted on October 16, 2015 by Martijn Kieboom

To start with, the audit log files generated by MapR Auditing include the type of action performed on the filesystem or MapR-DB table, the date & time of the action performed and the specific details on the file or table being part of the activity.

Posted on October 14, 2015 by Bharat Baddepudi

In this week's Whiteboard Walkthrough, Bharat Baddepudi, engineer on the MapR-DB team, explains how documents in MapR-DB are inserted and updated.

Posted on October 13, 2015 by Martijn Kieboom

With MapR version 5.0 being released recently, MapR customers got yet another powerful feature at no additional licensing costs: Auditing! In this two-folded blog post, I’ll describe various use cases for auditing as well as instructions for how to deploy these cases in your MapR environment.

Posted on September 29, 2015 by Bharat Baddepudi

MapR developed OJAI (the Open JSON Application Interface) which provides native integration of JSON-like document processing in Hadoop-style scale-out clusters.

Posted on September 29, 2015 by M.C. Srivas

In this week's Whiteboard Walkthrough, MC Srivas, MapR CTO and Co-Founder, explains the innovation and vision behind MapR-DB and how project Kudu stacks up to the MapR Data Platform.

Posted on September 25, 2015 by Anurag Choudhary

In this week's Whiteboard Walkthrough, Anurag Choudhary, Engineer on the MapR-DB team, explains how horizontal scaling in MapR-DB works and how hot spotting is automatically avoided.

Posted on September 1, 2015 by Mitra Kaseebhotla

Here at MapR, developer productivity is critical to us. In order to keep our pace of innovation high and give customers more choice and flexibility in Apache Hadoop and other open source projects we ship with the MapR Distribution for Hadoop, we apply DevOps methodologies as widely as we can. One critical piece of this is ensuring we can rapidly test our builds to ensure quality in the codebase. Automation is key here, which is what allows us to integrate all the latest innovations across multiple releases from the community in our Hadoop distribution.

Posted on July 22, 2015 by Abizer Adenwala

As a follow-up to my previous post on MapR-DB, I want to describe how to index MapR-DB table data in near real-time into Elasticsearch on Amazon Web Services (AWS) Elastic Compute Cloud (EC2).

Posted on July 21, 2015 by Abizer Adenwala

One of the common challenges of deploying a search engine is keeping the search indexes synchronized with the source data.

Posted on July 17, 2015 by Smidth Panchamia

Native languages like C/C++ provide a tighter control on memory and performance characteristics of the application. A well written C++ program that has intimate knowledge of the memory access patterns and the architecture of the machine will run several times faster than Java. For these reasons, a lot of enterprise customers with massive scalability and performance requirements tend to use C/C++ in their server applications in comparison to Java.

Posted on July 15, 2015 by Ted Dunning

In this week's Whiteboard Walkthrough, Ted Dunning, Chief Application Architect at MapR, talks about the architectural differences between HDFS and MapR-FS that boil down to three numbers.

Posted on July 6, 2015 by Jim Scott

In part one of this series, Drilling into Healthy Choices we explored using Drill to create Parquet tables as well as configuring Drill to read data formats that are not very standard. In part two of this series we are going to utilize this same database to think beyond traditional database design.

Posted on June 26, 2015 by Carol McDonald

Apache HBase is a database that runs on a Hadoop cluster. HBase is not a traditional RDBMS, as it relaxes the ACID (Atomicity, Consistency, Isolation, and Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.

Posted on May 11, 2015 by Nick Amato

In this post, I’ll show you how to build a simple real-time dashboard using Spark on MapR.

Posted on April 27, 2015 by Ted Dunning

There is some real math behind the idea that you need 3x replication in Hadoop. The basic idea is that when a disk goes bad, you lose an entire stripe of storage.

Posted on February 10, 2015 by James Casaletto

In this post, we will discuss how to use the MapR Control System (MCS) to monitor MRv1 jobs. We will also see how to manage and display jobs, history, and logs using the command line interface.

Posted on January 5, 2015 by Ted Dunning

A number of people have been claiming lately that interactive responses to queries constitute real-time processing. For instance, Mike Olson has been quoted saying that interactive queries are what is needed for real-time processing. I like to start with something more like the wikipedia definition of real-time computing instead. Wikipedia defines real time as a response before a deadline. A relaxed form of this is stream processing, where response is per record ASAP, but with no clear latency deadline.

Posted on December 17, 2014 by Abizer Adenwala

In this week's Whiteboard Walkthrough, Abizer Adenwala, Technical Support Engineer at MapR, walks you through what a storage pool is, why disks are striped, reasons disk would be marked as failed, what happens when a disk is marked failed, what to watch out for before reformatting/re-adding disk back, and what is the best path to recover from disk failure.

Posted on December 10, 2014 by James Casaletto

In this week's Whiteboard Walkthrough, James Casaletto walks you through how to configure the network for the MapR Hadoop Sandbox. Whether you use VirtualBox, VMware Fusion, VMware Player, or pretty much any hypervisor on your laptop to support your MapR Sandbox, you'll need to configure the network. There's essentially three different settings that you can use to configure the network for your Sandbox. One is NAT, one is host-only, and one is bridged.

Posted on September 29, 2014 by Nitin Bandugula

The capability to process live data streams enables businesses to make real-time, data-driven decisions. The decisions could be based on simple data aggregation rules or even complex business logic. The engines that support these decision models have to be fast, scalable and reliable and Hadoop, with its rapidly growing ecosystem, is fast emerging as the data platform that supports such real-time stream processing engines.

Posted on September 12, 2014 by Jim Scott

Why do this? There are many use cases for time series data, and they usually require handling a decent data ingest rate. Rates of more than 10,000 points per second are common and rates of 1 million points per second are not quite as common, but not outrageously high either.

Posted on September 8, 2014 by Kyle Porter

The MapR Distribution including Apache™ Hadoop® employs drivers from Simba Technologies to connect to client ODBC and JDBC applications allowing you to access data on MapR from tools like Tableau with ODBC or SQuirreL with JDBC. This post will walk you through the steps to set up and connect your Apache Hive instance to both an ODBC and JDBC application running on your laptop or other client machine. Although you may already have your own Hive cluster set up, this post focuses on the MapR Sandbox for Hadoop virtual machine (VM).

Posted on August 11, 2014 by Bruce Penn

A core-differentiating component of the MapR Distribution including Apache™ Hadoop® is the MapR File System, also known as MapR-FS. MapR-FS was architected from its very inception to enable truly enterprise-grade Hadoop by providing significantly better performance, reliability, efficiency, maintainability, and ease of use compared to the default Hadoop Distributed Files System (HDFS).

Posted on June 18, 2014 by Karen Whipple

NFS is the Network File System. It's been part of Linux and the broader Unix ecosystem for decades and been used for a long time in both enterprise environments to share files as well as in customized environments like high performance computing. 

Posted on June 2, 2014 by Ellen Friedman

The second publication in the O’Reilly Practical Machine Learning series, subtitled A New Look at Anomaly Detection by Ted Dunning and me, is being released this week.  In the previous book, which focused on practical approaches to recommendation, we started with the idea that everyone thinks “I want a pony”.  Here in the second book, what we want is to find the outlier, the zebra in a herd of ponies, the fish swimming against the school of fish, the rare event.

Posted on February 12, 2014 by Anoop Dawar

MapR has always been in the business of making Hadoop easier, and we believe we've made another big step forward with the MapR Sandbox for Hadoop. The MapR Sandbox gives developers and administrators the fastest and easiest way to get up to speed on Hadoop.

Posted on May 13, 2013 by Jim Fiori


Hadoop provides a compelling distributed platform for processing massive amounts of data in parallel using the Map/Reduce framework and the Hadoop distributed file system. A JAVA API allows developers to express the processing in terms of a map phase and a reduce phase, where both phases use key/value pairs or key/value list pairs as input/output.

Posted on March 14, 2013 by Jim Fiori


Running MapReduce jobs on ingested data is traditionally batch-oriented: the data must be first transferred to a local file system accessible to the Hadoop cluster, then copied into HDFS with Flume or the “hadoop fs” command. Only once the transfers are complete can MapReduce be run on the ingested files.

Posted on February 22, 2013 by Jim Fiori


Profiling Java applications can be accomplished with many tools, such as the built-in HPROF JVM native agent library for profiling heap and CPU usage. In the world of Hadoop and MapReduce, there are a number of properties you can set to enable profiling of your mapper and reducer code.

With MapR’s enterprise-grade distribution of Hadoop, there are 3 unique features that make this task of profiling MapReduce code easier. They are:

Blog Sign Up

Sign up and get the top posts from each week delivered to your inbox every Friday!

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free