Apache Hadoop Blog Posts

Posted on October 24, 2016 by Sameer Nori

Offloading cold or unused data and ETL workloads from a data warehouse to Hadoop/big data platforms is a very common starting point for enterprises beginning their big data journey. Platforms like Hadoop provide an economical way to store data and do bulk processing of large data sets; hence, it’s not surprising that cost is the primary driver for this initial use case.

Posted on October 18, 2016 by James Sun

If you’ve been keeping tabs on all the great product enhancements that have been coming out of MapR, you will know that the 5.2 version of the MapR Converged Data Platform went GA this summer. It takes a few cycles to make the platform available on the AWS marketplace, largely due to the testing efforts required.

Posted on October 5, 2016 by Ted Dunning

In this Whiteboard Walkthrough, MapR’s Chief Application Architect, Ted Dunning, explains the move from state to flow and shows how it works in a financial services example. Ted describes the revolution underway in moving from a traditional system with multiple programs built around a shared database to a new flow-based system that instead uses a shared state queue in the form of a message stream built with technology such as Apache Kafka or MapR Streams. This new architecture lets decisions be made locally and supports a micro-services style approach.

Posted on September 14, 2016 by Neeraja Rentachintala

Today we are excited to announce the availability of Drill 1.8 on the MapR Converged Data Platform. As part of the Apache Drill community, we continue to deliver iterative releases of Drill, providing significant feature enhancements along with enterprise readiness improvements based on feedback from a variety of customer deployments.

Posted on September 12, 2016 by Vikash Selvin

Elasticsearch and Kibana are widely used in the market today for data analytics; however, security is one aspect that was not initially built in to the product. Since data is the lifeline of any organization today, it becomes essential that Elasticsearch and Kibana be “secured.” In this blog post, we will be looking at one of the ways in which authentication, authorization, and encryption can be implemented for them.

Posted on August 31, 2016 by Mitesh Shah

Mitesh Shah, Senior Product Manager for Security and Data Governance at MapR, describes an important concept in big data security: vulnerability management, a key layer in the trust model. He explains what is provided by the MapR Converged Data Platform and what the role is of the customer in maintaining a secure environment.

Posted on August 30, 2016 by Dong Meng

Apache PredicitonIO is an open source machine learning server. In this article, we integrate Apache PredictionIO with the MapR Converged Data Platform 5.1 as a backend. Specifically, we use MapR-DB (1.1.1) for event data storage, ElasticSearch for metadata storage, and MapR-FS for model data storage.

Posted on July 15, 2016 by Ryan Victory

MapR Streams and MapR-DB are both very exciting developments in the MapR Converged Data Platform. In this blog post, I’m going to show you how to get Ruby code to natively interact with MapR-DB and MapR Streams.

Posted on July 13, 2016 by Philippe Cuzey

As a data analyst that primarily used Apache Pig in the past, I eventually needed to program more challenging jobs that required the use of Apache Spark, a more advanced and flexible language. At first, Spark may look a bit intimidating, but this blog post will show that the transition to Spark (especially PySpark) is quite easy.

Posted on July 8, 2016 by Craig Warman

In this blog post, I’ll describe how to install Apache Drill on the MapR Sandbox for Hadoop, resulting in a "super" sandbox environment that essentially provides the best of both worlds—a fully-functional, single-node MapR/Hadoop/Spark deployment with Apache Drill.

Posted on June 28, 2016 by Martijn Kieboom

This post describes step by step on how to deploy Mesos, Marathon, Docker and Spark on a MapR cluster and run various jobs as well as Docker containers using this deployment.

Posted on June 27, 2016 by Vince Gonzalez

MapR-FS provides some very useful capabilities for data management and access control. These features can and should be applied to user home directories. A user in a MapR cluster has a lot of capability at their fingertips. They can create files, two styles of NoSQL tables, and pub/sub messaging streams with many thousands of topics.

Posted on June 1, 2016 by Dale Kim

In this week's Whiteboard Walkthrough, Dale Kim, Director of Industry Solutions at MapR, explains the architectural differences between MapR-FS or the MapR File System, and the Hadoop Distributed File System (HDFS).

Posted on May 4, 2016 by Nick Amato

If you’ve had a chance to work with Hadoop or Spark a little, you probably already know that HDFS doesn't support full random read-writes or many other capabilities typically required in a production-ready file system.

Posted on April 29, 2016 by Mathieu Dumoulin

There are many options for monitoring the performance and health of a MapR cluster. In this post, I will present the lesser-known method for monitoring the CLDB using the Java Management Extensions (JMX).

Posted on March 16, 2016 by Mitesh Shah

In this week's Whiteboard Walkthrough, Mitesh Shah, Product Management at MapR, describes how you can track administrative operations and data accesses in your MapR Converged Data Platform in a seamless and efficient way with the built-in auditing feature.

Posted on March 8, 2016 by Anoop Dawar

In 2015, MapR shipped three significant core releases : 4.0.2 in January, 4.1 in April, 5.0 and the GA version of Apache Drill in July. While all this was happening, many of my colleagues in engineering (who’ve demonstrated a whole new level of ingenuity and multitasking) were also working on one of the biggest releases in the history of MapR—the converged data platform release (AKA, MapR 5.1).

Posted on March 8, 2016 by Jim Scott

The distributed computation world has seen a massive shift in the last decade. Apache Hadoop showed up on the scene and brought with it new ways to handle distributed computation at scale. It wasn’t the easiest to work with, and the APIs were far from perfect, but they worked.

Posted on February 25, 2016 by Harish Thakkallapally

In this article, I am going to show you how to use the impersonation feature to create and access a file in MapR-FS. In this example, we will run a Java program as the “mapr” superuser that will run operations on behalf of the “user01” user.

Posted on February 8, 2016 by Nelson Estrada

This is the second post in a series about the MapR Command Line Interface. The first post gave you an idea of how to use the command line to review your cluster nodes, what services were running, and how to manage them.

Posted on November 19, 2015 by Paul Curtis

Apache Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with Spark SQL, Scala, Hive, Flink, Kylin and more. Zeppelin enables rapid development of Spark and Hadoop workflows with simple, easy visualizations.

Posted on November 5, 2015 by Mitra Kaseebhotla

In this blog post, we’ll go into detail about our solution for providing YARN clusters on shared infrastructure which have the same security and performance.

Posted on October 23, 2015 by Aditya Kishore

In this week's Whiteboard Walkthrough, Aditya Kishore, engineer on the MapR-DB team, explains how to use the OJAI API to insert, search, and update the document database.

Posted on October 13, 2015 by Martijn Kieboom

With MapR version 5.0 being released recently, MapR customers got yet another powerful feature at no additional licensing costs: Auditing! In this two-folded blog post, I’ll describe various use cases for auditing as well as instructions for how to deploy these cases in your MapR environment.

Posted on October 1, 2015 by Joseph Blue

It’s difficult to describe what a real breach looks like, but you will know it when you see it. To identify a potential breach, we assess the amount of activity of accounts later experiencing fraud at each merchant and then visualize the results.

Posted on September 11, 2015 by Hao Zhu

In this blog post, I will explain the resource allocation configurations for Spark on YARN, describe the yarn-client and yarn-cluster modes, and will include examples. Spark can request two resources in YARN: CPU and memory.

Posted on September 1, 2015 by Mitra Kaseebhotla

Here at MapR, developer productivity is critical to us. In order to keep our pace of innovation high and give customers more choice and flexibility in Apache Hadoop and other open source projects we ship with the MapR Distribution for Hadoop, we apply DevOps methodologies as widely as we can. One critical piece of this is ensuring we can rapidly test our builds to ensure quality in the codebase. Automation is key here, which is what allows us to integrate all the latest innovations across multiple releases from the community in our Hadoop distribution.

Posted on July 30, 2015 by Abizer Adenwala

In this post, I’ll show you what happens behind the scenes, from the time a user fires any job from a Client/Edge node or Cluster node, until the time the job actually gets submitted to the JobTracker for execution.

Posted on July 24, 2015 by Hao Zhu

In this blog post, I will discuss best practices for YARN resource management. The fundamental idea of MRv2(YARN) is to split up the two major functionalities—resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).

Posted on July 8, 2015 by Andy Lerner

Teradata Connector for Hadoop (TDCH) is a key component of Teradata’s Unified Data Architecture for moving data between Teradata and Hadoop. TDCH invokes a mapreduce job on the Hadoop cluster to push/pull data to/from Teradata databases, with each mapper moving a portion of the data, in parallel across all nodes, for very fast transfers.

Posted on June 26, 2015 by Carol McDonald

Apache HBase is a database that runs on a Hadoop cluster. HBase is not a traditional RDBMS, as it relaxes the ACID (Atomicity, Consistency, Isolation, and Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.

Posted on April 27, 2015 by Ted Dunning

There is some real math behind the idea that you need 3x replication in Hadoop. The basic idea is that when a disk goes bad, you lose an entire stripe of storage.

Posted on February 11, 2015 by Jim Scott

In this week's Whiteboard Walkthrough, Jim Scott, Director of Enterprise Strategy and Architecture at MapR, explains the differences between Apache Mesos and YARN, and why one may or may not be better in global resource management than the other.

Posted on January 6, 2015 by Carol McDonald

SQL will become one of the most prolific use cases in the Hadoop ecosystem, according to Forrester Research. Apache Drill is an open source SQL query engine for big data exploration. REST services and clients have emerged as popular technologies on the Internet. Apache HBase is a hugely popular Hadoop NoSQL database. In this blog post, I will discuss combining all of these technologies: SQL, Hadoop, Drill, REST with JSON, NoSQL, and HBase, by showing how to use the Drill REST API to query HBase and Hive. I will also share a simple jQuery client that uses the Drill REST API, with JSON as the data exchange, to provide a basic user interface.

Posted on December 10, 2014 by James Casaletto

In this week's Whiteboard Walkthrough, James Casaletto walks you through how to configure the network for the MapR Hadoop Sandbox. Whether you use VirtualBox, VMware Fusion, VMware Player, or pretty much any hypervisor on your laptop to support your MapR Sandbox, you'll need to configure the network. There's essentially three different settings that you can use to configure the network for your Sandbox. One is NAT, one is host-only, and one is bridged.

Posted on December 9, 2014 by Neeraja Rentachintala

With Hadoop becoming more prominent in customer environments, one of the frequent questions we hear from users is what should be the storage format to persist data in Hadoop. The data format selection is a critical decision especially as Hadoop evolves from being about cheap storage to a pervasive query and analytics platform. In this blog, I want to briefly describe self-describing data formats, how they are gaining a lot of interest as a new management paradigm to consumerize Hadoop data in organizations and the work we have been doing as part of the Parquet community to evolve Parquet as fully self-describing format.

Posted on December 1, 2014 by Nitin Bandugula

The community version of Apache Hadoop 2.6 was released recently. Some of the cool new things that are part of the Hadoop 2.6 release include changes to YARN to support rolling upgrades, where the ResourceManager and the NodeManager will now preserve state information. Further highlights include label-based scheduling for YARN (with code contributions from an existing MapR feature) along with an alpha feature for running YARN applications natively in docker containers.

Posted on November 21, 2014 by Jim Bates

One of the challenges with Hadoop is getting value out of it without having to learn all the new skillsets that you need to truly harness Hadoop’s power. The reality of using the MapR Distribution including Hadoop is… you don’t have to know Hadoop to use Hadoop! I recently came up against this again and thought I would throw it out there and hopefully make someone’s journey to their first Hadoop job a no-brainer.

Posted on October 29, 2014 by Venkata Sowrirajan

One of the most recent and highly used functional programming language is Scala. It is used in some of the Hadoop ecosystem components like Apache Spark, Apache Kafka, etc. So it would be really useful for someone to develop applications using Scala that uses Hadoop and the ecosystem projects. In this post, I am going to show you how to run a Scala application that accesses Hadoop data.

Posted on September 29, 2014 by Nitin Bandugula

The capability to process live data streams enables businesses to make real-time, data-driven decisions. The decisions could be based on simple data aggregation rules or even complex business logic. The engines that support these decision models have to be fast, scalable and reliable and Hadoop, with its rapidly growing ecosystem, is fast emerging as the data platform that supports such real-time stream processing engines.

Posted on September 16, 2014 by Nitin Bandugula

The September release of the Apache open source packages in MapR is now available for customers. The September updates to the Apache Open Source packages in the MapR Distribution are part of the MapR 4.0.1 major release. Details about the MapR 4.0.1 release can be found here.

Here are the top highlights of this month’s release:

Posted on September 12, 2014 by Jim Scott

Why do this? There are many use cases for time series data, and they usually require handling a decent data ingest rate. Rates of more than 10,000 points per second are common and rates of 1 million points per second are not quite as common, but not outrageously high either.

Posted on July 7, 2014 by Michael Hausenblas

I have a background  in Open Data and will say that this area has a lot in common with Open Source software. One of the core tenets is that of freedom. No single organization, independent of its size or accumulated brain power, can ever anticipate all things that are possible; be that with the data or be it with regards to code.

Posted on June 23, 2014 by Karen Whipple

Snapshots come up in most technical discussions of enterprise-grade applications.  Hadoop applications are no exception.  Jack Norris talks with Industry Analyst Donnie Berkholz about snapshots to help cut to the core of what the issues are in the context of Hadoop on the What Are The Facts video series.

Posted on June 18, 2014 by Karen Whipple

NFS is the Network File System. It's been part of Linux and the broader Unix ecosystem for decades and been used for a long time in both enterprise environments to share files as well as in customized environments like high performance computing. 

Posted on May 14, 2014 by Patrick Toole

With our recent announcement of HP Vertica’s deployment onto MapR, we have already been flooded with questions about the integration.

Use Cases

Posted on May 7, 2014 by Jon Posnik

SQL-on-Hadoop just got easier this morning.  Working together with the HP Vertica team, we are excited to announce general availability of the HP Vertica Analytics Platform running on the MapR Distribution for Apache Hadoop.

Posted on April 29, 2014 by Jacques Nadeau

MapR recently hosted the first Apache Drill hackathon, with nearly forty people in attendance who helped push Drill toward its first beta release. It was great to see people from companies such as Visa, Cisco, LinkedIn and Hortonworks come together to harden and enhance the Apache Drill project. 

The hackathon participants worked on many different aspects of Apache Drill. Over the next few weeks, these features will be incorporated into mainline. Here’s a preview of what we worked on, coming soon to a master near you:

Posted on April 20, 2014 by Neeraja Rentachintala

Following the alpha milestone release in November 2013, the open source incubator project Apache Drill is well on its way towards its next big milestone, the 1.0 beta release.

Posted on April 11, 2014 by Anoop Dawar

On the heels of the recent Spark stack inclusion announcement, here is some more fresh powder (For non-skiers, that’s fresh snow on a mountain).

MapR Distribution of Apache Hadoop: 4.0.0 Beta

Posted on April 10, 2014 by Tomer Shiran
With over 500 paying customers, my team and I have the opportunity to talk to many organizations that are leveraging Hadoop in production to extract value from big data. One of the most common topics raised by our customers in recent months is Apache Spark. Some customers just want to learn more about the advantages of this technology and the use cases that it addresses, while others are already running it in production with the MapR Distribution.
Posted on April 8, 2014 by Michele Nemschoff

We are excited to announce the availability of the installation guide for RHadoop on MapR, titled “RHadoop and MapR: Accessing Enterprise-Grade Hadoop from R”. This highly detailed installation guide explains how to install and use RHadoop with MapR and R on RedHat Enterprise Linux.

Posted on April 4, 2014 by Karen Whipple
Amazon Elastic MapReduce (Amazon EMR) makes it easy to provision and manage Hadoop in the AWS Cloud. The latest webinar from the Amazon Web Services Partner webinar series, titled “Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS,” showed examples of how to use Amazon EMR with the MapR Distribution for Apache Hadoop, and outlined the advantages of using the cloud to increase flexibility and accelerate projects while lowering costs.
Posted on February 26, 2014 by Anoop Dawar

As many of you may recall, YARN was first released in October 2013 and gathered a lot of buzz in the later part of the year. On February 20th, the community voted to release the next version of YARN with Hadoop 2.3.0. We are delighted at the progress that has been made with YARN in particular. Here are the Hadoop 2.3.0 release notes from the Apache Hadoop website.

Posted on February 11, 2014 by Neeraja Rentachintala

Today we are very excited to announce early access of the new HP Vertica Analytics Platform on MapR at the O’Reilly Strata Conference: Making Data Work. This solution tightly integrates HP Vertica’s high-performance analytic platform directly on the MapR Enterprise-Grade Distribution for Hadoop with no “connectors” required. We wanted to provide some additional details on this integration and why this is important for customers.

Posted on February 11, 2014 by Anoop Dawar

It gives me immense pleasure to write this blog on behalf of all of us here at MapR to announce the release of Hadoop 2.x, including YARN, on MapR. Much has been written about Hadoop 2.x and YARN and how it promises to expand Hadoop beyond MapReduce. I will give a quick summary before highlighting some of the unique benefits of Hadoop 2.x and YARN in the MapR Distribution for Hadoop.


Posted on February 10, 2014 by Ali Hussain

We, at Flux7 Labs, a solutions company, help customers maximize performance/$. To help our customers make the right decisions we constantly research and evaluate the solutions available to customers and thereby build and strengthen our internal knowledge. As part of this research process, we evaluated the most common Hadoop distributions on various metrics. The distributions we tested were from Intel, Cloudera, Hortonworks, and MapR. This testing was done independently on all the distributions.

Posted on December 15, 2013 by Steve Wooledge

The advancement in SQL development for Hadoop is making it possible for SQL professionals to apply their skills and SQL tools of choice to Big Data projects. Based on their use case, SQL pros can choose from Apache projects Hive and Drill, Apache-licensed Impala, and proprietary technologies such as Hadapt, HAWQ, and Splice Machine. Hive, the most mature of these technologies, is widely used and best suited for long running queries. The other technologies are very new to the marketplace.

Posted on September 18, 2013 by Ted Dunning
A person I know recently asked for advice about an interesting but very common scenario. He wanted to join a large reference file that didn’t get updated very often with a somewhat smaller file that was updated daily, but he wasn’t quite sure how to go about it in Apache Hadoop.

Posted on July 19, 2013 by Karen Whipple

Mike Gualtieri, Principal Analyst with Forrester Research joined us for a webinar titled Productionizing Hadoop: Seven Architectural Best Practices. Following the webinar, Mike answered a number of questions from participants, including a question about productionizing Hadoop.

Q: How do you move from the test phase to productionizing Hadoop? How do you put it into practice?

Posted on May 13, 2013 by Jim Fiori


Hadoop provides a compelling distributed platform for processing massive amounts of data in parallel using the Map/Reduce framework and the Hadoop distributed file system. A JAVA API allows developers to express the processing in terms of a map phase and a reduce phase, where both phases use key/value pairs or key/value list pairs as input/output.

Posted on March 11, 2013 by Michael Hausenblas
Gartner Analyst Merv Adrian weighed in on “Hadoop Open Source, ‘Purity,’ and Market Realities” in his blog over the weekend.  He writes:

“The fact is, there is an entire industry building products atop Apache open source code – and that is the point of having Apache license its projects and provide the other services it does for the open source community…

Posted on January 10, 2013 by Andy Lerner

RHadoop and MapR technical brief now available

If you are a data analyst or statistician familiar with the R programming language and you want to use Hadoop to run MapReduce jobs or access HBase tables, Revolution Analytics has created RHadoop to make your life easier.  You can use all of your existing R programs and add MapReduce and HBase functionality.  You get all the statistical analysis capabilities of your R environment with the enterprise grade, massively scalable, distributed compute provided by MapR's Hadoop distribution.

Posted on May 25, 2012 by Keys Botzum

Eclipse is a popular development tool and we thought it would be helpful to share some tips on using Eclipse with MapR to write MapReduce programs. The following notes describe how to enable Eclipse as your development IDE for Hadoop MapReduce jobs using an existing MapR cluster.

Blog Sign Up

Sign up and get the top posts from each week delivered to your inbox every Friday!

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free