Higher software performance gives you more from your hardware, thus lowering your total cost of ownership.
Despite their common Hadoop roots, not all distributions are the same. When putting together an RFP for Hadoop, pay particular attention to performance, scalability, reliability, manageability, and data access.
Performance and Scalability
You want your Hadoop deployment to run fast. This not only gets jobs and work
done faster, but also delivers more value from your hardware so you can lower
your total cost of ownership.
Scalability ensures that your big data can continue to grow without outpacing your system capacity. Consider the different aspects in which your data will grow, including overall data volume, number of files, and number of database (HBase) records.
RFP questions you should ask prospective Hadoop vendors about performance and scalability include:
- What features and innovations do you provide that deliver performance and scalability advantages?
- What public references demonstrate your distribution’s performance and scalability advantages?
- How have your customers quantitatively benefited from your performance and scalability advantages?
- What is the specific configuration required to scale the number of files beyond 100mm, 500mm, 1bn, and beyond?
- What hardware configuration (disk density, network bandwidth, etc.) provides the best performance characteristics with your distribution?
Understand the differences between each vendor’s enterprise-grade high availability, data protection, and disaster recovery capabilities.
You should expect Hadoop to be subject to the same reliability expectations as every other enterprise software system.
- High availability (HA) refers to the ability to service users even when confronted with node failures or network partitions.
- Data protection capabilities let you restore specific data elements upon accidental loss or corruption.
- Disaster recovery is about maintaining system continuity through the use of a remote replica despite a site-wide failure in the primary data center.
- How do you ensure HA for all services on your distribution, and describe how failures are handled, and what the application and end user impact is for each? Specifically, please address NameNode, HBase Master, HBase RegionServer, and NFS Gateway capabilities.
- What other provisions for high availability are included in your distribution?
- How does your distribution enable exact point-in-time recovery of files or directories in the cluster from accidental deletions or corruption due to user or application error?
- How does your distribution support disaster recovery strategies?
- Describe how it handles incremental and differential copies.
- Can individual datasets be copied on independent schedules?
- How can you configure recovery point objectives (RPO) and recovery time objectives (RTO)?
- What are the administrative steps required to recover from a full, unplanned cluster restart due to power loss or other catastrophic condition? How much time is required for the cluster to become fully available for reads and writes after such a scenario?
- How does your solution handle misreporting drives or corruption at the hardware or underlying filesystem level?
Managing big data can be daunting, so you should seek a distribution for Hadoop
that lowers the administrative burden.Manageability, including
multi-tenancy, is key to
success and continued
Questions to ask prospective Hadoop vendors about manageability include:
- What capabilities unique to your distribution does your Hadoop administration functionality provide?
- What cluster monitoring tools can be integrated with your distribution?
- What other features of your distribution facilitate the management of a deployed cluster?
- . How does your management software monitor hardware failures such as disk failures?
- How can you keep distinct user data in a shared cluster while maintaining security and isolation?
Data Access and Ingestion
Hadoop is often used to capture data from across many data sources and systems,
so interoperability and security are critical aspects of any Hadoop deployment. Real-time data access
is important when
deploying Hadoop in
an existing enterprise
Questions to ask your prospective Hadoop vendor about data access include:
- What features or innovations do you provide that promote or facilitate interoperability with other data sources and systems?
- How is data secured in your distribution to ensure users can only access data for which they are authorized?
- What are some production customer proof points that demonstrate your realtime capabilities?
- Can applications directly write data files into the Hadoop file system without staging data for ingestion via a load process?
- How do you support interoperability, especially via industry standard interfaces such as random read-write NFS, REST, etc.?
- How does your solution provide for analytics on continuously ingested realtime data? How can latency for ingestion can be reduced with your distribution?
- What are the steps to integrate Hadoop with operational NoSQL databases?
Selecting your Hadoop vendor based on this criteria will ensure your deployment is about business value and competitive advantage, and not about spending excessive time on building and maintaining infrastructure.