Set the Bar High for Enterprise Data Hub Requirements

There has been a lot of discussion lately about enterprise data hubs and data lakes. The backdrop driving this discussion is that organizations are struggling today with rapidly growing data from multiple data sources. Machine-generated log files, sensor data and social media are a few of these fast growing data sources.
There are many existing analytic platforms in organizations today that are focused on specific data and applications. Enterprises recognize that there is more value in combining these silos of data to open up new use cases. What is required is a central platform that can serve as a collection point for a broad and varied set of data and can accommodate a wide set of use cases—an enterprise data hub.

Hadoop as an Enterprise Data Hub was discussed in depth by Mike Ferguson in his May 2013 paper “Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop.” Mike was a principal and co-founder of Codd and Date Europe Limited (the inventors of the Relational Model) and was also a Chief Architect at Teradata. In his paper, Mike delineates the requirements for an enterprise data hub and why the MapR Distribution for Hadoop is best suited to serve this purpose. The platform capabilities that he discusses include full data protection, business continuity and availability features to form the foundation for cleansing, transforming and integrating structured and multi-structured data from multiple sources.

Mike notes that “MapRʼs data protection and disaster recovery capabilities make MapR Hadoop distributions suitable for long-term storage of Big Data and data warehouse archived data, which can then be selectively re-processed in specific analyses.”

MapR invested several years of engineering effort to re-architect a data platform for Hadoop so it could support such enterprise-grade capabilities. Other distributions claim enterprise functionality without the right platform to support it, and such false expectations are setting users up for grand failure—at an enterprise level. The facts are:

• Only MapR provides automated stateful failover, disaster recovery through snapshots and mirrors, and full data protection against user and application errors.

• MapR eliminates downtime associated with HBase applications with instant recovery and provides consistent low latency support with no compactions and no Java garbage collection.

• Even with multiple hardware or software outages and errors, applications will continue running without any administrator actions required.

• MapR’s distributed, No NameNode HA architecture provides fast recovery. On a large cluster, MapR can recover from a 1000 node outage within three minutes. The same recovery would take over 24 hours on any other Hadoop distribution.

The MapR Distribution for Hadoop is best suited to meet the requirements of an enterprise data hub. MapR’s full data protection, business continuity and disaster recovery features make MapR the best choice for companies who are moving towards an enterprise data hub solution in order to maintain their competitive advantage.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free