Apache Hadoop for the MapR Converged Data Platform includes the latest innovations from the Hadoop 2.X and open source communities such as Apache HBase™, Apache Storm™, Apache Pig, Apache Hive™, Apache Mahout™, YARN, Apache Sqoop™, Apache Flume™, and more. MapR not only provides advanced high availability (HA) and data protection features such as resilience upon multiple node failures, snapshots, and mirroring for disaster recovery (DR), but also enables seamless Hadoop access and data management capabilities through industry-standard interfaces such as NFS and ODBC.
Apache Hadoop is a software library that enables distributed processing of large data sets across a cluster of servers. MapR delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of mission critical and real-time production uses.
Apache Hadoop for the MapR Converged Data Platform combines open source Hadoop software with enterprise-grade MapR Platform Services to make Hadoop more powerful and secure. MapR is the production choice for Hadoop, providing the industry’s only converged data platform that combines the processing power of Hadoop and Spark with global event streaming, real-time database capabilities, and enterprise storage.
Open Choice, Open Source
Project choices. MapR supports a broad set of Hadoop projects, including the entire Apache Spark™ stack, YARN, Apache Drill, Impala, and more. MapR helps customers select the right tool for their specific requirements.
Monthly certified updates. MapR gives you access to the latest cutting-edge projects on Hadoop.
Backward compatibility. MapR lets you upgrade specific projects without needing to upgrade core Hadoop packages. Additionally, MapR lets you upgrade Hadoop and run your existing applications as is without rewriting them.
Operational Analytics with In-Hadoop NoSQL
The high performance, integrated NoSQL database, MapR-DB, lets you run analytics on live data without data copying, and deploy multiple use cases and workloads in a single, operationally efficient cluster.
High-Throughput Event Streaming
MapR Streams enables you to globally and reliably deliver event data streams for real-time processing. With MapR Streams, you can connect data producers and consumers in a high performance, publish/subscribe model.
Self-Service SQL Analytics on Hadoop
Apache Drill on MapR lets you immediately query complex datasets such as deeply nested data, NoSQL data, and data with rapidly evolving schemas, without requiring schema preparation. ANSI SQL support lets you use your existing business intelligence tools.
Standard Hadoop tool support. MapR supports all Hadoop APIs and Hadoop data processing tools to access Hadoop data. You can move data in the MapR Platform easily into other distributions, and vice versa.
Standards-based file access. Unlike other distributions, MapR provides true Network File System (NFS) capabilities. MapR Direct Access NFS™ lets you access Hadoop like a standard file system, to copy data into and out of Hadoop easily at high rates, or to access Hadoop data using common command line tools and desktop applications. The optional add-on MapR POSIX Client provides authenticated NFS access from remote nodes, along with over-the-wire compression and parallel access to boost throughput.
Industry standards. MapR fully supports additional industry-standard APIs, including ODBC/JDBC, LDAP, Kerberos, HBase, HDFS, NFS, and more.
Third-party tool ecosystem. The entire ecosystem of third-party tools (BI, ETL, etc.) built for use on Hadoop work on MapR. Examples of certified tools are available at the MapR App Gallery mapr.com/appgallery.
Portable applications. Hadoop applications built on MapR run on any other Hadoop distribution, and vice versa, with no code changes or recompilation.
Kerberos and LDAP integration. MapR supports authentication services via Kerberos and/or LDAP.
Access control. Data is secured using standard Unix file permissions and advanced role-based access control expressions (ACEs).
Native authentication. MapR also offers a standards-based authentication system as a simpler alternative to Kerberos that leverages Linux Pluggable Authentication Modules (PAM) to provide the widest registry support.
Comprehensive auditing. MapR auditing logs help to analyze user behavior as well as to meet regulatory compliance requirements. MapR uses the JSON format to log accesses at the administrative, authentication, database, and file levels.
Performant wire-level encryption. MapR encrypts data sent between nodes and applications to ensure data privacy, using Intel AES-NI capabilities where available.
MapR supports multi-tenancy beyond the capabilities in YARN via advanced resource management and control to let distinct user groups, data sets, and applications coexist in isolation in the same cluster.
Security. MapR authentication and authorization controls provide another level of user and data isolation.
Volumes. MapR supports the logical grouping of files and directories on which policies (permissions, replication factors, quotas, etc.) can be set.
ExpressLane. MapR avoids starvation of small jobs by letting them run even when the cluster is busy with large jobs.
Job placement control. Even beyond YARN, MapR manages resources with label-based job placement to specify which nodes can run the specified job.
Data placement control. Configure the cluster topology to define on which nodes specific data is placed for performance, security, and optimal utilization purposes.
Performance and Scalability
Customers can reduce their data center footprint with the MapR performance advantage by deploying as few as one-third the servers of other distributions. Faster file access and a faster optimized shuffle for MapReduce lets customers get more work out of their hardware investment. A MapR cluster can scale to thousands of nodes and can store trillions of files.
MapR officially set the MinuteSort record by sorting 1.5 TB of data in under a minute on Google Compute Engine. A MapR customer has since exceeded that record by sorting 1.65 TB, with one-seventh the number of servers of the highest nonMapR record.
High Availability (HA)
MapR HA eliminates single points of failure to tolerate multiple node failures and ensure no unplanned downtime, no data loss, and no work loss. MapR HA requires no special configuration and is enabled automatically.
YARN HA and JobTracker HA. Work is tracked to let them run to completion despite node failures.
No-NameNode architecture. Cluster filename metadata is distributed to ensure the cluster data is always available and accessible.
NFS HA. Continuous NFS access is ensured to avoid disruptions to standard file system access.
Management and Monitoring
MapR Control System. To manage, administer, and monitor your Hadoop cluster, the MapR Control System (MCS) is a browser-based interface to let you immediately view the status of your cluster via heatmaps, and drill into specific issues to investigate any problems. Alarms proactively notify you if potential problems arise.
Rolling upgrades. To minimize planned downtime, MapR allows a node-by-node Hadoop upgrade on a live cluster. With MapR backward compatibility, existing applications can still run on an upgraded Hadoop cluster with no modifications.
Disaster Recovery (DR)
Customers can maintain continuity despite a site-wide disaster, and also can quickly recover damaged and accidentally deleted files.
Mirrors. Mirrors are consistent copies of a cluster replicated to a remote site, either on-premises or in the cloud. Scheduled mirroring incrementally updates the mirror by only sending block-level differentials from the source cluster to shorten the recovery point objective (RPO). Promotable Mirrors enable fast and easy switchover of replicas to active production use to shorten the recovery time objective (RTO). Mirrors can also be used for load balancing, as well as for wide geographic distribution to reduce network latency for distant end users.
Consistent snapshots. Capture the exact state of the cluster at the time the snapshot is taken, to enable point-in-time recovery of files that were corrupted or deleted due to application or user error. Snapshots are also useful for running machine learning algorithms on a static view of data, as well as for auditing data sets.
MapR Breakthrough Innovations
- Performance-optimized architecture for faster data processing and analytics
- Architecture designed specifically for high availability across all cluster operations
- Automatic disaster recovery through mirroring to synchronize data across clusters
- Direct Access NFS™ for real-time data access to Hadoop data
- Distributed metadata to support trillions of files in a single cluster
- Comprehensive security controls to protect sensitive data
- Consistent snapshots for accurate point-in-time recovery
- MapR Heatmap™ for instant cluster insights
- MapR volumes for easier policy management around security, placement, retention, and quotas
- Integrated NoSQL and event streaming for advanced real-time capabilities