Apache Hadoop™ was born out of a need to process an avalanche of Big Data. The web was generating more and more information on a daily basis, and it was becoming very difficult to index over one billion pages of content. In order to cope, Google invented a new style of data processing known as MapReduce. A year after Google published a white paper describing the MapReduce framework, Doug Cutting and Mike Cafarella, inspired by the white paper, created Hadoop to apply these concepts to an open-source software framework to support distribution for the Nutch search engine project. Given the original case, Hadoop was designed with a simple write-once storage infrastructure.
Hadoop has moved far beyond its beginnings in web indexing and is now used in many industries for a huge variety of tasks that all share the common theme of lots of variety, volume and velocity of data – both structured and unstructured. It is now widely used across industries, including finance, media and entertainment, government, healthcare, information services, retail, and other industries with Big Data requirements but the limitations of the original storage infrastructure remain.
Hadoop is increasingly becoming the go-to framework for large-scale, data-intensive deployments. Hadoop is built to process large amounts of data from terabytes to petabytes and beyond. With this much data, it’s unlikely that it would fit on a single computer's hard drive, much less in memory. The beauty of Hadoop is that it is designed to efficiently process huge amounts of data by connecting many commodity computers together to work in parallel. Using the MapReduce model, Hadoop can take a query over a dataset, divide it, and run it in parallel over multiple nodes. Distributing the computation solves the problem of having data that’s too large to fit onto a single machine.
Apache Hadoop includes a Distributed File System (HDFS), which breaks up input data and stores data on the compute nodes. This makes it possible for data to be processed in parallel using all of the machines in the cluster. The Apache Hadoop Distributed File System is written in Java and runs on different operating systems.
Hadoop was designed from the beginning to accommodate multiple file system implementations and there are a number available. HDFS and the S3 file system are probably the most widely used, but many others are available, including the MapR file system.
How is Hadoop Different from Past Techniques?
Hadoop can handle data in a very fluid way. Hadoop is more than just a faster, cheaper database and analytics tool. Unlike databases, Hadoop doesn’t insist that you structure your data. Data may be unstructured and schemaless. Users can dump their data into the framework without needing to reformat it. By contrast, relational databases require that data be structured and schemas be defined before storing the data.
Hadoop has a simplified programming model. Hadoop’s simplified programming model allows users to quickly write and test distributed systems. Performing computation on large volumes of data has been done before, usually in a distributed setting but writing distributed systems is notoriously hard. By trading away some programming flexibility, Hadoop makes it much easier to write distributed programs.
Hadoop is easy to administer. Alternative high performance computing (HPC) systems allow programs to run on large collections of computers, but they typically require rigid program configuration and generally require that data be stored on a separate storage area network (SAN) system. Schedulers on HPC clusters require careful administration and since program execution is sensitive to node failure, administration of a Hadoop cluster is much easier.
Hadoop invisibly handles job control issues such as node failure. If a node fails, Hadoop makes sure the computations are run on other nodes and that data stored on that node are recovered from other nodes.
Hadoop is agile. Relational databases are good at storing and processing data sets with predefined and rigid data models. For unstructured data, relational databases lack the agility and scalability that is needed. Apache Hadoop makes it possible to cheaply process and analyze huge amounts of both structured and unstructured data together, and to process data without defining all structure ahead of time.
Why use Apache Hadoop?
It’s cost effective. Apache Hadoop controls costs by storing data more affordably per terabyte than other platforms. Instead of thousands to tens of thousands per terabyte Hadoop delivers compute and storage for hundreds of dollars per terabyte.
It’s fault-tolerant. Fault tolerance is one of the most important advantages of using Hadoop. Even if individual nodes experience high rates of failure when running jobs on a large cluster, data is replicated across a cluster so that it can be recovered easily in the face of disk, node or rack failures.
It’s flexible. The flexible way that data is stored in Apache Hadoop is one of its biggest assets – enabling businesses to generate value from data that was previously considered too expensive to be stored and processed in traditional databases. With Hadoop, you can use all types of data, both structured and unstructured, to extract more meaningful business insights from more of your data.
It’s scalable. Hadoop is a highly scalable storage platform, because it can store and distribute very large data sets across clusters of hundreds of inexpensive servers operating in parallel. The problem with traditional relational database management systems (RDBMS) is that they can’t scale to process massive volumes of data.
Why MapR is the best distribution for Apache Hadoop
MapR’s CTO and Co-founder, M.C. Srivas, was previously at Google and re-architected the MapR platform to address the limitations of Hadoop. With MapR, there are no Java dependencies or reliance on the Linux file system. MapR provides a dynamic read-write data layer that brings unprecedented dependability, ease-of-use, and world-record speed to Hadoop, NoSQL, database and streaming applications in one unified Big Data platform.
MapR is a complete Distribution for Apache Hadoop that combines over a dozen different open source packages from the Hadoop ecosystem along with enterprise-grade features that provide unique capabilities for management, data protection, and business continuity. These include: Apache Hive, Apache Pig, Cascading, Apache HCatalog, Apache HBase™, Apache Oozie, Apache Flume, Apache Sqoop, Apache Mahout, and Apache Whirr. The source code for all of these packages, including MapR’s fixes and enhancements, is publicly available on Github at www.github.com/mapr. In addition, MapR has released the binaries, source code and documentation in a public Maven repository making it easier for developers to develop, build and deploy their Hadoop-based applications.
Among the many distinct advantages of the MapR platform is that data can be ingested as a real-time stream; analysis can be performed directly on the data, and automated responses can be executed. By applying "polyglot persistence", we give you the ability to leverage multiple data storages, depending on your use cases.