Building an Enterprise Data Hub: Choosing the foundational software

Don’t ask your data warehouses to handle tasks they shouldn’t handle. They were never intended to be an all-purpose processor of structured, unstructured, and semi-structured information. Dealing with data being flung at them constantly from an ever-growing variety of devices and platforms was never in the job description. They’re meant for high value analysis, and strategic extract-load-transform (ELT) processing, but big data is making things hard for them.

So where does this leave you, with your big data stockpile growing by the minute and your mandate to produce profit from all that information? You just might need an enterprise data hub (EDH) which will serve as a central platform to collect all kinds of data from multiple sources, process it quickly, and make it available throughout the enterprise. This means the EDH can handle multiple workloads that complement your data warehouse. It can load fast, incoming streams of data, process and transform it, and then potentially load the outputs into your data warehouse. Or you can use your EDH as an external transformation engine to handle some of the complex and overly resource-intensive ELT workloads currently running in your data warehouse. Or you can run large-scale, historical analytics on your less frequently used, long-tail data. And then you can let your data warehouses focus on the important jobs they do best. Many organizations are pursuing this strategy of “data warehouse optimization,” which starts with an EDH.

In this post, we’ll look at the foundational software that powers your hub. To learn about the underlying hardware that lets your system run efficiently at scale, read Cisco's blog on building an EDHTo aggregate all your data sources into your EDH, read Informatica’s blog on data integration.

Building the Foundation

Considering the range of data types you might load into an EDH, you need a foundational platform that can accommodate a broad variety of data. Not only that, since the rapidly growing data volumes can quickly overwhelm an existing system, you want a deployment that can scale linearly by adding more servers. And since an EDH is intended to complement your business-critical data warehouse, as well as the rest of your enterprise architecture, you need enterprise-grade features like high availability, disaster recovery, and security to help you meet stringent service-level agreements. These are just a few factors that will influence your choice of software to drive an EDH.

Apache™ Hadoop® as the Foundational Software

There exists an increasingly popular choice of technologies when you’re looking for a cost-effective, powerful solution to handle big data and enable an EDH. Apache Hadoop uniquely provides enterprises with the ability to store and efficiently utilize massive amounts of structured, unstructured, and semi-structured data across the entire organization. Bringing all of this data together allows for deep dives into information, and more accurate analysis.

With all the data processing and analysis tools available as part of the large Hadoop ecosystem, you have many choices of technologies to deploy with Hadoop in your EDH. You can choose from technologies for streaming data, machine learning, real-time querying, and even SQL-on-Hadoop. Even veteran software companies have adapted their proven products to work well with Hadoop. As more and more developers and IT professionals get ramped up on Hadoop, it can potentially become as commonplace as data warehouses and relational databases in today’s enterprise architectures.

But let’s not forget to address the enterprise-readiness of your Hadoop solution. Since you can’t expect to complement a business-critical data warehouse with a suboptimal EDH deployment, you need to adhere to your enterprise-grade requirements. This obviously starts with high uptime, business continuity strategies, security measures, and consistently high performance. And what about the less obvious but equally critical requirements?

An important concept with EDHs is the notion of multi-tenancy, in which disparate user groups, data sets, jobs, and applications coexist in a single cluster while remaining isolated from each other. Since EDHs aggregate data from numerous sources, it necessarily becomes a hub for disparate users groups and data sets. The alternative to multi-tenancy is to build separate, distinct clusters for each user group, but in practice, organizations have found this to be an extremely high maintenance operation.

The MapR Distribution including Apache Hadoop for Your EDH

MapR offers a distribution of Hadoop that is optimized for enterprise use, with features that include full data protection, business continuity and high availability features, as well as the ability to process and integrate structured and multi-structured data from multiple sources into an analysis-ready form. Just as important, MapR supports multi-tenancy with several features including the support for logical volumes, integrated security, data placement control, and job placement control. Many MapR customers have deployed EDHs with great success, and the unique features in MapR let customers deploy a valuable complement to their formerly overloaded enterprise architectures.

If you haven’t already done so, watch this latest Gartner video and get key considerations for successfully building out an enterprise data hub.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free