CITO Research White Paper: Five Questions to Ask Before Choosing a Hadoop Distribution

Companies everywhere are excited about harnessing big data and putting it to work. Adopting a Hadoop distribution is a critical decision that has far-reaching ramifications for your organization. CITO Research recognizes this in its white paper, “Five Questions to Ask Before Choosing a Hadoop Distribution.”

CITO Research has outlined five key questions to ask before adopting a Hadoop distribution, so that you don’t end up with buyer’s remorse:

1. What does it take to make Hadoop enterprise-ready? Disaster recovery and security are two areas which are critical for making Hadoop part of your enterprise data architecture. Since you need disaster recovery capabilities for your Hadoop cluster along with replication and high availability, CITO Research recommends that you consider a commercial Hadoop distribution with these capabilities along with snapshots. MapR is one such distribution that supports disaster recovery, high availability, and consistent point-in-time snapshots, thanks to built-in advantages of its POSIX-compliant file system with full random read-write capability.

Hadoop disaster recover and data protection

Security is another vital area to consider when looking at the various Hadoop distributions. All distributions have their own implementations of security, but some have distinct advantages. For example, Apache Hadoop offers Kerberos authentication, but many organizations want to integrate with another implementation via Linux Pluggable Authentication Modules (PAM). The MapR Distribution offers a choice of Kerberos integration or a native authentication mechanism that integrates with other security standards so you can enable security-aware applications. And MapR takes a platform-level approach to security to ensure that there are no alternate, exposed access paths to data.

2. Does the distribution offer scalability, reliability, and performance? Hadoop distributions that rely on the Hadoop Distributed File System (HDFS) require NameNodes and depend on a write-once, append-only file system which limits the scalability, reliability and performance of big data applications. The MapR architecture provides a no-NameNode architecture, which improves reliability and scalability. CITO praises MapR because its architecture ensures that there’s no single point of failure in the platform, and no added complexity of managing NameNodes, because NameNode metadata is completely and automatically distributed across every data node in the cluster.MapR Hadoop architecture

3. Is the distribution efficient when it comes to TCO and ease of administration? MapR requires far less hardware than other Apache Hadoop distributions, resulting in a lower TCO. We’ve developed a TCO calculator for Hadoop to take you through the reasons.

NoSQL platform MapR-DB

4. How flexible and open is the distribution? CITO Research cites several aspects including Hadoop API support, support for other enterprise standards such as NFS, and the flexibility to run multiple versions of Apache Hadoop ecosystem projects (e.g., Hive 0.12, Hive 0.13, or YARN and MRv1). MapR provides all of the above, giving customers the flexibility to run Apache Hadoop and other open source or commercial engines (e.g., HP Vertica) natively in the MapR Distribution. This gives customers more options and the ability to use the right tool for the job to solve their business problem.

5. What additional workforce expertise does a company need to run Hadoop? CITO praises MapR for “greatly simplifying Hadoop, improving scalability and performance in the process and freeing administrators from many low-level details.”

The full CITO Research White Paper “Five Questions to Ask Before Choosing a Hadoop Distribution” can be accessed here.

Want to learn more?

If you have any questions or comments, please post them in the comments section below.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free