Apache Hadoop was born out of a need to process an avalanche of big data. The web was generating more and more information on a daily basis, and it was becoming very difficult to index over one billion pages of content. Hadoop has moved far beyond its beginnings in web indexing and is now used in many industries for a large variety of tasks that all share the common theme of lots of variety, volume, and velocity of data—both structured and unstructured.Learn More
YARN (Yet Another Resource Negotiator) is a core component of Hadoop that manages access to all resources in a cluster. Before YARN, jobs were forced to go through the MapReduce framework, which is designed for long-running batch operations. Now, YARN brokers access to cluster compute resources on behalf of multiple applications, using selectable criteria such as fairness or capacity, allowing for a more general-purpose experience.Learn More
Apache MapReduce is a powerful framework for processing large, distributed sets of structured or unstructured data on a Hadoop cluster. The key feature of MapReduce is its ability to perform processing across an entire cluster of nodes, with each node processing its local data. This feature makes MapReduce orders of magnitude faster than legacy methods of processing big data, which often consisted of a single node accessing and processing data located in remote SAN or NAS devices.Learn More
Apache Hive is an open source Hadoop application for data warehousing. It offers a simple way to apply structure to large amounts of unstructured data, and then perform batch SQL-like queries on that data.Learn More
Apache Pig: Many users who are new to Hadoop find that the MapReduce framework has a steep learning curve. Apache Pig helps these users by offering a simpler alternative for transforming and analyzing large data sets. Users write scripts in a high-level language called Pig Latin, which Pig translates into MapReduce jobs that run on a Hadoop cluster.Learn More
Apache Spark is a general-purpose graph execution engine for Hadoop that allows users to analyze large data sets with very high performance. One common use case for Spark is executing MapReduce-style graphs, achieving high performance batch processing in Hadoop.Learn More
Apache Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data sources and data formats, including nested, self-describing data.Learn More
NoSQL & Search
Apache HBase is a database that runs on a Hadoop cluster. Clients can access HBase data through either a native Java API, or through a Thrift or REST gateway, making it accessible by any language.Learn More
GraphX is a graph library that runs on top of Apache Spark. Developers can use the languages and tools they are familiar with using for Spark to implement new types of algorithms that require the modeling of relationships between objects.Learn More
Apache Mahout is a powerful, scalable, machine-learning library that runs on top of Hadoop MapReduce. Machine learning is a discipline of artificial intelligence that enables systems to learn based on data alone, continuously improving performance as more data is processed. Machine learning is the basis for many technologies that are part of our everyday lives.Learn More
MLlib/MLBase: MLlib is a machine learning library that runs on top of Apache Spark. Machine learning is a discipline of artificial intelligence that enables systems to learn based on data alone, continuously improving performance as more data is processed. Machine learning is the basis for many technologies that are part of our everyday lives.Learn More
Spark Streaming: When Hadoop first emerged, it provided a platform to store petabytes of data, and perform batch queries on that data to gather insights. This model works well for many use cases, like analyzing vast amounts of customer data for interesting patterns. However, not all data can wait for a batch query to be performed.Learn More
HttpFS is one of several tools available to interact with the MapR distributed file system. Some differentiating features of HttpFS include programmatic access, version independence, and remote access.Learn More
Apache Sqoop: Hadoop users often want to perform analysis of data across multiple sources and formats, and a common source is a relational database or data warehouse. Sqoop allows users to efficiently move structured data from these sources into Hadoop for analysis and correlation with other data types, such as semi-structured and unstructured data stored in the distributed file system. Once analysis has been completed, Sqoop can be used to push any resulting structured data back into a database or data warehouse so it is available for operational use.Learn More
Apache Flume is a distributed and reliable system for efficiently collecting, aggregating, and moving large amounts of log or event data from many sources to a centralized data store like the MapR Data Platform.Learn More
Apache Oozie is a valuable tool for Hadoop users to automate commonly performed tasks in order to save time and prevent user error. With Oozie, users can describe workflows to be performed on a Hadoop cluster, schedule those workflows to execute under a specified condition, and even combine multiple workflows and schedules together into a package to manage their full lifecycle.Learn More
ZooKeeper: In any distributed cluster, it is important that all nodes be able to share configuration and state data in a reliable way. Hadoop relies on ZooKeeper to keep each of its distributed processes, including MapReduce and HBase, consistent across the cluster. ZooKeeper nodes store a shared hierarchical name space of data registers in RAM, allowing clients to access it with high throughput and low latency. Hadoop clusters should be provisioned with an odd number of ZooKeeper nodes, typically either 3 or 5, to provide high availability and maintain a quorum.Learn More
Apache Myriad is an open source Hadoop project that lets YARN applications run side by side with Apache Mesos frameworks. It does this by registering YARN as a Mesos framework, requesting Mesos resources on which to launch YARN applications. This allows YARN applications to run on top of a Mesos cluster without any modification.Learn More
GUI, Configuration, Monitoring
Hue (Hadoop User Experience) offers a web GUI to Hadoop users to simplify the process of creating, maintaining, and running many types of Hadoop jobs. Hue is made up of several applications that interact with Hadoop components, and has an open SDK to allow new applications to be created.Learn More
When applications go from idea to reality, MapR provides the only production-ready platform for Hadoop, Spark and related technologies.
The design of the patented MapR Converged Data Platform speaks directly to Enterprise Architects who know best that architecture matters.
MapR provides developers the widest variety of popular open source projects for developing data applications.