Get Real with Hadoop: Read-Write File System

In this blog series, we’re showcasing the top 10 reasons customers are turning to MapR in order to create new insights and optimize their data-driven strategies. Here’s reason #9: MapR provides a read-write file system for real-time Hadoop.

When the Apache Hadoop project first began, it was intended to be used for web crawling and indexing the internet. This very myopic use case required nothing more than a write-once read-many (or append-only) paradigm. It became the foundational implementation that is HDFS. The web crawling use case is a tremendously limited view among all of the use cases in effect for Apache Hadoop today.

The MapR Data Platform, which is the foundation of the MapR Distribution including Apache Hadoop, delivers a true file system that is POSIX-compliant with full random read-write capability. Instead of setting up Linux with EXT4 and then installing HDFS on top of that, you set up Linux with MapR-FS. Significant speed benefits are observed because there are less layers in this architecture.

To make a comparison between append-only and random read-write would be to compare a typewriter with a word processor. If you want to go back and change something with a typewriter, you need to start the page over and retype everything. With a word processor, you simply make the change at the appropriate place in the document.

Let’s take a look at the different parts of the MapR Distribution that benefit from a read-write capable file system.

NFS
HDFS NFS support requires utilization of the local file system to temporarily write data before it lands in HDFS. There are two major problems with this. First, the data can potentially be copied out of order. Second, this means space must be reserved in the local file system to allow NFS enough space to land data before it can get copied into HDFS.

MapR NFS support, on the other hand, is true NFS. It is accessed like any other storage device. Any application you have that can read and write to an NFS mount can read and write to MapR-FS. You don’t need to reserve local storage for it to work.

In addition to MapR NFS, MapR also supports the HDFS API, giving you even more options for integrating the MapR Distribution in your environment.

NameNode
The NameNode in Apache Hadoop is a single point of failure and a choke point for the platform. It limits the cluster to around 50-100 million total files in the system.

MapR doesn’t have a NameNode. The MapR distributed metadata architecture enables a single MapR cluster to support one trillion files and database tables on a single cluster. This is directly enabled by a random read-write file system. The MapR no-NameNode architecture means less hassles and less administrative overhead. Friends don’t let friends run NameNodes.

Real-time Hadoop
Apache HBase had to implement concepts like tombstones and compactions in order to be able to run on HDFS. They are workarounds for a write-once, read-many file system. Automatic compactions and region splits can cause the platform to be unstable during heavy production loads, and are recommended to be disabled in a production environment.

MapR-DB implements the same API as HBase, but because it is implemented on a random read-write-capable file system, it doesn’t need tombstones or compactions. This enables high performance (an average of 2-7x faster than standard Apache HBase) and consistent low latency for your operational applications using MapR-DB.

In summary, the MapR Data Platform provides support for more standards, superior performance, and greater scalability while reducing the amount of hardware and administration required for Hadoop. Take MapR for a test drive yourself and see what a difference it makes.


If you never try you will never know

And get the complete top 10 list here.

no

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free