MapR packages a broad set of Apache open source ecosystem projects enabling batch, interactive, or real-time applications. It is a complete distribution which is pre-tested, pre-integrated, hardened and includes Hive, Pig, Apache HBase™, Oozie™, Sqoop™, Flume, Mahout and a huge amount of innovative engineering that significantly moves Hadoop forward. The data platform and the projects are all tied together through an advanced management console to monitor and manage the entire system.


The diagram above illustrates all the components of the MapR Distribution for Apache Hadoop.

Please click on each project below to learn more about them and links to resources managed by the Hadoop community.

   
CORE HADOOP

Core Hadoop

Apache Hadoop™ was born out of a need to process an avalanche of Big Data. The web was generating more and more information on a daily basis, and it was becoming very difficult to index over one billion pages of content. Hadoop has moved far beyond its beginnings in web indexing and is now used in many industries for a huge variety of tasks that all share the common theme of lots of variety, volume and velocity of data – both structured and unstructured.

Learn more

YARN


YARN (Yet Another Resource Negotiator) is a core component of Hadoop, managing access to all resources in a cluster. Before YARN, jobs were forced to go through the MapReduce framework, which is designed for long-running batch operations. Now, YARN brokers access to cluster compute resources on behalf of multiple applications, using selectable criteria such as fairness or capacity, allowing for a more general-purpose experience.


Learn more
BATCH

MapReduce

Apache MapReduce is a powerful framework for processing large, distributed sets of structured or unstructured data on a Hadoop cluster. The key feature of MapReduce is its ability to perform processing across an entire cluster of nodes, with each node processing its local data. This feature makes MapReduce orders of magnitude faster than legacy methods of processing big data, which often consisted of a single node accessing and processing data located in remote SAN or NAS devices.


Learn more

Hive

Apache Hive is an open source Hadoop application for data warehousing. It offers a simple way to apply structure to large amounts of unstructured data, and then perform batch SQL-like queries on that data.


Learn more

Tez


Tez is a generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases.


Pig

Many users who are new to Hadoop find that the MapReduce framework has a steep learning curve. Apache Pig helps these users by offering a simpler alternative for transforming and analyzing large data sets. Users write scripts in a high level language called Pig Latin, which Pig translates into MapReduce jobs that run on a Hadoop cluster.


Learn more

Cascading

Cascading is a data processing API and processing query planner used for defining, sharing, and executing data processing workflows. On a distributed computing cluster using the Apache Hadoop platform, Cascading adds an abstraction layer over the Hadoop API, greatly simplifying Hadoop application development, job creation, and job scheduling.


Learn more

Spark

Apache Spark is a general-purpose graph execution engine for Hadoop that allows users to analyze large data sets with very high performance. One common use case for Spark is executing MapReduce-style graphs, achieving high performance batch processing in Hadoop.


Learn more
INTERACTIVE SQL

Impala


Impala is an open source, interactive SQL engine for Hadoop. With Impala, you can use business intelligence (BI) tools to run ad-hoc queries directly on the data in a cluster, stored either in unstructured flat files in the file system, or in structured HBase tables. Compared to Hive, which is optimized for long-running batch queries at scale, Impala is optimized for interactive queries on smaller data sets where users expect responses in seconds.


Learn more

Drill

Apache Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data sources and data formats, including nested, self-describing data.


Learn more
NOSQL & SEARCH

HBase

Apache HBase is a database that runs on a Hadoop cluster. Clients can access HBase data through either a native Java API, or through a Thrift or REST gateway, making it accessible by any language.


Learn more

Search


Integrating Search provides a single platform to perform predictive analytics, full search and discovery, as well as advanced database operations. The MapR Distribution for Hadoop now includes LucidWorks Search.


Learn more
GRAPH

GraphX


GraphX is a graph library that runs on top of Apache Spark. Developers can use the languages and tools they are familiar with using for Spark to implement new types of algorithms that require the modeling of relationships between objects.


Learn more
MACHINE LEARNING

Mahout

Apache Mahout is a powerful, scalable machine-learning library that runs on top of Hadoop MapReduce. Machine learning is a discipline of artificial intelligence that enables systems to learn based on data alone, continuously improving performance as more data is processed. Machine learning is the basis for many technologies that are part of our everyday lives.


Learn more

MLlib/MLBase


MLlib is a machine learning library that runs on top of Apache Spark. Machine learning is a discipline of artificial intelligence that enables systems to learn based on data alone, continuously improving performance as more data is processed. Machine learning is the basis for many technologies that are part of our everyday lives.


Learn more
STREAMING

Spark Streaming

When Hadoop first emerged, it provided a platform to store petabytes of data, and perform batch queries on that data to gather insights. This model works well for many use cases, like analyzing vast amounts of customer data for interesting patterns. However, not all data can wait for a batch query to be performed.


Learn more
DATA TOOLS

HttpFS


HttpFS is one of several tools available to interact with the MapR distributed file system. Some differentiating features of HttpFS include programmatic access, version independence, and remote access.


Learn more

Sqoop

Hadoop users often want to perform analysis of data across multiple sources and formats, and a common source is a relational database or data warehouse. Sqoop allows users to efficiently move structured data from these sources into Hadoop for analysis and correlation with other data types, such as semi-structured and unstructured data stored in the distributed file system. Once analysis has been completed, Sqoop can be used to push any resulting structured data back into a database or data warehouse so it is available for operational use.


Learn more

Flume

Apache Flume is a distributed and reliable system for efficiently collecting, aggregating, and moving large amounts of log or event data from many sources to a centralized data store like MapR Data Platform.


Learn more
COORDINATION

Oozie

Apache Oozie is a valuable tool for Hadoop users to automate commonly performed tasks in order to save time and user error. With Oozie, users can describe workflows to be performed on a Hadoop cluster, schedule those workflows to execute under a specified condition, and even combine multiple workflows and schedules together into a package to manage their full lifecycle.


Learn more

ZooKeeper

In any distributed cluster, it is important that all nodes be able to share configuration and state data in a reliable way. Hadoop relies on ZooKeeper to keep each of its distributed processes, including MapReduce and HBase, consistent across the cluster. ZooKeeper nodes store a shared hierarchical name space of data registers in RAM, allowing clients to access it with high throughput and low latency. Hadoop clusters should be provisioned with an odd number of ZooKeeper nodes, typically either 3 or 5, to provide high availability and maintain a quorum.


Learn more
GUI, CONFIGURATION, MONITORING

Hue

Hue (Hadoop User Experience) offers a web GUI to Hadoop users to simplify the process of creating, maintaining, and running many types of Hadoop jobs. Hue is made up of several applications that interact with Hadoop components, and has an open SDK to allow new applications to be created.


Learn more

Whirr

Apache Whirr is a set of libraries for running cloud services. Whirr provides a cloud-neutral way to run services so you don't have to worry about the idiosyncrasies of each provider. It also includes common service API. The details of provisioning are particular to the service. Whirr features smart defaults for services so that you can get a properly configured system running quickly, while still being able to override settings as needed. Whirr can also be used as a command line tool for deploying clusters.


Learn more