MapR packages a broad set of Apache open source ecosystem projects that enable big data applications. The goal is to provide you with an open platform that lets you choose the right tool for the job. MapR tests and integrates open source ecosystem projects such as Hive™, Pig™, Apache™ HBase™ and Mahout, among others. The MapR Converged Data Platform and the open source projects are tied together through an advanced management console to monitor and manage the system.

MapR also makes available Developer Previews for new features and technologies that are still under development.

MapR Converged Data Platform

Core Hadoop

Apache Hadoop was born out of a need to process an avalanche of big data. The web was generating more and more information on a daily basis, and it was becoming very difficult to index over one billion pages of content. Hadoop has moved far beyond its beginnings in web indexing and is now used in many industries for a large variety of tasks that all share the common theme of lots of variety, volume, and velocity of data—both structured and unstructured.

Learn More

YARN (Yet Another Resource Negotiator) is a core component of Hadoop that manages access to all resources in a cluster. Before YARN, jobs were forced to go through the MapReduce framework, which is designed for long-running batch operations. Now, YARN brokers access to cluster compute resources on behalf of multiple applications, using selectable criteria such as fairness or capacity, allowing for a more general-purpose experience.

Learn More

Batch

Apache MapReduce is a powerful framework for processing large, distributed sets of structured or unstructured data on a Hadoop cluster. The key feature of MapReduce is its ability to perform processing across an entire cluster of nodes, with each node processing its local data. This feature makes MapReduce orders of magnitude faster than legacy methods of processing big data, which often consisted of a single node accessing and processing data located in remote SAN or NAS devices.

Learn More

Apache Hive is an open source Hadoop application for data warehousing. It offers a simple way to apply structure to large amounts of unstructured data, and then perform batch SQL-like queries on that data.

Learn More

Apache Pig: Many users who are new to Hadoop find that the MapReduce framework has a steep learning curve. Apache Pig helps these users by offering a simpler alternative for transforming and analyzing large data sets. Users write scripts in a high-level language called Pig Latin, which Pig translates into MapReduce jobs that run on a Hadoop cluster.

Learn More

Apache Spark is a general-purpose graph execution engine for Hadoop that allows users to analyze large data sets with very high performance. One common use case for Spark is executing MapReduce-style graphs, achieving high performance batch processing in Hadoop.

Learn More

Interactive SQL

Apache Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data sources and data formats, including nested, self-describing data.

Learn More

Impala is an open source, interactive SQL engine for Hadoop. With Impala, you can use business intelligence (BI) tools to run ad-hoc queries directly on the data in a cluster, stored either in unstructured flat files in the file system, or in structured HBase tables. Compared to Hive, which is optimized for long-running batch queries at scale, Impala is optimized for interactive queries on smaller data sets where users expect responses in seconds.

Learn More

NoSQL & Search

Apache HBase is a database that runs on a Hadoop cluster. Clients can access HBase data through either a native Java API, or through a Thrift or REST gateway, making it accessible by any language.

Learn More

Apache Solr: Solr is a full-text search and indexing engine that enables large-scale search, navigation, and analytics on textual data. It can run within the MapR Converged Data Platform to provide information retrieval capabilities that are familiar to all Internet users.

Learn More

Graph

GraphX is a graph library that runs on top of Apache Spark. Developers can use the languages and tools they are familiar with using for Spark to implement new types of algorithms that require the modeling of relationships between objects.

Learn More

Machine Learning

Apache Mahout is a powerful, scalable, machine-learning library that runs on top of Hadoop MapReduce. Machine learning is a discipline of artificial intelligence that enables systems to learn based on data alone, continuously improving performance as more data is processed. Machine learning is the basis for many technologies that are part of our everyday lives.

Learn More

MLlib/MLBase: MLlib is a machine learning library that runs on top of Apache Spark. Machine learning is a discipline of artificial intelligence that enables systems to learn based on data alone, continuously improving performance as more data is processed. Machine learning is the basis for many technologies that are part of our everyday lives.

Learn More

Streaming

Spark Streaming: When Hadoop first emerged, it provided a platform to store petabytes of data, and perform batch queries on that data to gather insights. This model works well for many use cases, like analyzing vast amounts of customer data for interesting patterns. However, not all data can wait for a batch query to be performed.

Learn More

Data Tools

HttpFS is one of several tools available to interact with the MapR distributed file system. Some differentiating features of HttpFS include programmatic access, version independence, and remote access.

Learn More

Apache Sqoop: Hadoop users often want to perform analysis of data across multiple sources and formats, and a common source is a relational database or data warehouse. Sqoop allows users to efficiently move structured data from these sources into Hadoop for analysis and correlation with other data types, such as semi-structured and unstructured data stored in the distributed file system. Once analysis has been completed, Sqoop can be used to push any resulting structured data back into a database or data warehouse so it is available for operational use.

Learn More

Apache Flume is a distributed and reliable system for efficiently collecting, aggregating, and moving large amounts of log or event data from many sources to a centralized data store like the MapR Data Platform.

Learn More

Coordination

Apache Oozie is a valuable tool for Hadoop users to automate commonly performed tasks in order to save time and prevent user error. With Oozie, users can describe workflows to be performed on a Hadoop cluster, schedule those workflows to execute under a specified condition, and even combine multiple workflows and schedules together into a package to manage their full lifecycle.

Learn More

ZooKeeper: In any distributed cluster, it is important that all nodes be able to share configuration and state data in a reliable way. Hadoop relies on ZooKeeper to keep each of its distributed processes, including MapReduce and HBase, consistent across the cluster. ZooKeeper nodes store a shared hierarchical name space of data registers in RAM, allowing clients to access it with high throughput and low latency. Hadoop clusters should be provisioned with an odd number of ZooKeeper nodes, typically either 3 or 5, to provide high availability and maintain a quorum.

Learn More

Apache Myriad is an open source Hadoop project that lets YARN applications run side by side with Apache Mesos frameworks. It does this by registering YARN as a Mesos framework, requesting Mesos resources on which to launch YARN applications. This allows YARN applications to run on top of a Mesos cluster without any modification.

Learn More

GUI, Configuration, Monitoring

Hue (Hadoop User Experience) offers a web GUI to Hadoop users to simplify the process of creating, maintaining, and running many types of Hadoop jobs. Hue is made up of several applications that interact with Hadoop components, and has an open SDK to allow new applications to be created.

Learn More

Administrator

When applications go from idea to reality, MapR provides the only production-ready platform for Hadoop, Spark and related technologies.

Learn More

Enterprise Architect

The design of the patented MapR Converged Data Platform speaks directly to Enterprise Architects who know best that architecture matters.

Learn More

Developer

MapR provides developers the widest variety of popular open source projects for developing data applications.


Learn More

Image Map