The MapR Distribution for Apache Hadoop is the easiest, most dependable, and fastest Hadoop distribution on the planet. It is the only Hadoop distribution that allows direct data input and output via MapR Direct Access NFS™ with realtime analytics, and the first to provide true High Availability (HA) at all levels. MapR introduces logical volumes to Hadoop. A volume is a way to group data and apply policy across an entire data set. MapR provides hardware status and control with the MapR Control System, a comprehensive UI including a Heatmap™ that displays the health of the entire cluster at a glance.
In this section, you can learn about MapR's unique features and how they provide the highest performing, lowest cost Hadoop available.
To learn more about MapR, including information about MapR partners, see the following sections:
MapR Provides Complete Hadoop Compatibility
MapR is a complete Hadoop distribution.
MapR provides the following packages:
- Apache Hadoop 0.20.2
- Cascading 2.0
- Flume 1.2.0
- Hbase 0.92.1
- Hcatalog 0.4.0
- Hive 0.9.0
- Mahout 0.6 and 0.7
- Oozie 3.1.0
- Pig 0.10.0
- Sqoop 1.4.1
- Whirr 0.7.0
For more information, see the Version 2.0 Release Notes.
Intuitive, Powerful Cluster Management with the MapR Control System
The MapR Control System webapp provides powerful hardware insight down to the node level, as well as complete control of users, volumes, quotas, mirroring, and snapshots. Filterable alarms and notifications provide immediate warnings about hardware failures or other conditions that require attention, allowing a cluster administrator to detect and resolve problems quickly.
MapR lets you control data access and placement, so that multiple concurrent Hadoop jobs can safely share the cluster.
Provisioning resources is simple. You can easily create a volume for a project or department in a few clicks. MapR integrates with NIS and LDAP, making it easy to manage users and groups. The MapR Control System provides a flexible web-based user interface to cluster administration. From the MapR Control System, you can assign user or group quotas, limit the amount of data a user or group can write, or limit a volume's size.
Setting recovery time objective (RTO) and recovery point objective (RPO) points for a data set are a simple matter of scheduling snapshots and mirrors on a volume through the MapR Control System. You can set read and write permissions on volumes directly via NFS or using hadoop fs commands, and volumes provide administrative delegation through Access Control Lists (ACLs). Through the MapR Control System you can control who can mount, unmount, snapshot, or mirror a volume.
Because MapR is a complete Hadoop distribution, you can run your Hadoop jobs the way you always have.
Unrestricted Writes to the Cluster with MapR Direct Access NFS
The MapR NFS service lets you access data on a licensed MapR cluster via the NFS protocol. You mount a cluster through NFS on a variety of clients.
Clusters with the M3 license can run MapR NFS on one node, enabling you to mount your cluster as a standard POSIX-compliant filesystem. Once your cluster is mounted on NFS, you can use standard shell scripting to read and write live data in the cluster.
You can run multiple NFS server nodes by upgrading to the M5 license level. You can use virtual IP addresses (VIPs) to provide transparent NFS failover with multiple NFS servers. You can also have each node in your cluster self-mount to NFS to make all of your cluster's data available from every node. These NFS self-mounts enable you to run standard shell scripts to work with the cluster's Hadoop data directly.
Data Protection, Availability, and Performance with Volume Management
With volumes, you can control access to data, set replication factor, and place specific data sets on specific racks or nodes for performance or data protection. Volumes control data access to specific users or groups with Linux-style permissions that integrate with existing LDAP and NIS directories. Use volume quotas to prevent data overruns from consuming excessive storage capacity.
One of the most powerful aspects of the volume concept is the ways in which a volume provides data protection:
- To enable point-in-time recovery and easy backups, volumes have manual and policy-based snapshot capability.
- For true business continuity, you can manually or automatically mirror volumes and synchronize them between clusters or datacenters to enable easy disaster recovery.
- You can set volume read/write permission and delegate administrative functions to control data access.
You can export volumes with MapR Direct Access NFS with HA, allowing data read and write operations directly to Hadoop without the need for temporary storage or log collection. Multiple NFS nodes provide the same view of the cluster regardless of where the client connects.
Realtime Hadoop Analytics: Intuitive and Powerful Performance Metrics
New in the 2.0 release, the MapR Job Metrics service provides in-depth access to the performance statistics of your cluster and the jobs that run on it. With MapR Job Metrics, you can examine trends in resource use, diagnose unusual node behavior, or examine how changes in your job configuration affects the job's execution.
The MapR Node Metrics service, also new in the 2.0 release, provides detailed information on the activity and resource usage of specific nodes within your cluster.
Critical MapR services collect information on cluster resource utilization and activity that you can write directly to a file or integrate into the Ganglia third-party tool.
Expand Your Capabilities with Third-Party Solutions
MapR has partnered with Datameer, which provides a self-service Business Intelligence platform that runs best on the MapR Distribution for Apache Hadoop. Your download of MapR includes a 30-day trial version of Datameer Analytics Solution (DAS), which provides spreadsheet-style analytics, ETL and data visualization capabilities.
For More Information
- Read about Provisioning Applications
- Learn about Direct Access NFS
- Check out Datameer
Reliability, Fault-Tolerance, and Data Recovery with MapR
With clusters growing to thousands of nodes, hardware failures are inevitable even with the most reliable machines in place. The MapR Distribution for Hadoop has been designed from the ground up to seamlessly tolerate hardware failure.
MapR is the first Hadoop distibution to provide true high availability (HA) and failover at all levels, including a MapR Distributed HA NameNode™. If a disk or node in the cluster fails, MapR automatically restarts any affected processes on another node without requiring administrative intervention. The HA JobTracker ensures that any tasks interrupted by a node or disk failure are re-started on another TaskTracker node. In the event of any failure, the job's completed task state is preserved and no tasks are lost. For additional data reliability, every bit of data on the wire is compressed and CRC-checked.
For more information:
- Take a look at the Heatmap
- Learn about Volumes, Snapshots, and Mirroring
- Explore Data Protection scenarios
- Read about Job Metrics and Node Metrics
High-Performance Hadoop Clusters with MapR Direct Shuffle
The MapR distribution for Hadoop achieves up to three times the performance of any other Hadoop distribution, and can reduce your equipment costs by half.
MapR Direct Shuffle uses the Distributed NameNode to drastically improve Reduce-phase performance. Unlike Hadoop distributions that use the local filesystem for shuffle and HTTP to transport shuffle data, MapR shuffle data is readable directly from anywhere on the network. MapR stores data with Lockless Storage Services™, a sharded system that eliminates contention and overhead from data transport and retrieval. Automatic, transparent client-side compression reduces network overhead and reduces footprint on disk, while direct block device I/O provides throughput at hardware speed with no additional overhead. As an additional performance boost, with MapR Realtime Hadoop, you can read files while they are still being written.
MapR gives you ways to tune the performance of your cluster. Using mirrors, you can load-balance reads on highly-accessed data to alleviate bottlenecks and improve read bandwidth to multiple users. You can run MapR Direct Access NFS on many nodes – all nodes in the cluster, if desired – and load-balance reads and writes across the entire cluster. Volume topology helps you further tune performance by allowing you to place resource-intensive Hadoop jobs and high-activity data on the fastest machines in the cluster.
For more information:
- Read about Tuning Your MapR Install
- Read about Provisioning for Performance
Get Started
Now that you know a bit about how the features of MapR Distribution for Apache Hadoop work, take a quick tour to see for yourself how they can work for you:
- Quick Start - Test Drive MapR on a Virtual Machine - Try out a single-node cluster that's ready to roll, right out of the box!
- Installation Guide - Learn how to set up a production cluster, large or small
- Development Guide - Read more about what you can do with a MapR cluster
- Administration Guide - Learn how to configure and tune a MapR cluster for performance