The MapR Difference

Summary

The buzz around big data is louder than ever. Companies everywhere are reimagining the way they conduct business, placing data—instead of intuition—at the heart of decision-making. The value proposition of becoming a data-driven business is clear, with expectations of increased revenues, bigger margins, and new business opportunities. But not all data platforms are created equal. The MapR Converged Data Platform combines the flexibility and innovation of the open-source Hadoop ecosystem with an enterprise-grade data platform services layer. The patented MapR data platform services layer exposes a true, POSIX-compliant, read/write, distributed file system that can support any existing applications in addition to newer big data applications built for Hadoop. The data layer accesses data directly in secondary storage instead of going through the overhead of Java Virtual Machines making the MapR Converged Data Platform the only open-API big data system that is robust, versatile, and fast enough for mission-critical business processes. If you think about your business’s IT infrastructure as a building, the data storage layer is like the foundation. If you build a strong foundation, you can construct a bigger, better building on top of it. But if there are cracks in your foundation or if you cut corners, you’ll be limited in what type and size of building you can create. In the end, your building is only as good as its foundation. So how is MapR different?

Legacy Enterprise Data Architectures

Consider legacy IT systems, defined by distinct data ingest, cleanse, ETL, staging, analytics, and reporting phases. Once data comes into your organization, it has to go through multiple systems before it can actually be used to make business decisions. That means that there is a long period of time between when you get the data and when it can impact your business. What’s worse, the data gets landed and stored redundantly at each phase of the pipeline, meaning that you’re storing the same data multiple

times. On top of that, each of these phases requires its own distinct set of servers and storage to process the data. This is the reality of the traditional enterprise data warehouse ecosystem. It’s a fragmented process, resulting in huge amounts of IT overhead to manage multiple systems, and the data might even be out-of-date by the time it’s ready to be used by the final reporting and analytics processes.

So, you decide to modernize your IT infrastructure to make your business more agile, but which big data platform is really best for you? Sure, everyone promises that their platform gives you high performance, high scalability, high availability, and easy integration with existing applications. But let’s take a closer look at what that means.

Apache Hadoop Data Architectures Built on HDFS

Hadoop is the gold star platform for big data applications. But the traditional Hadoop big data platform is built on the open-source Hadoop Distributed File System (HDFS) data layer. HDFS is a Java-based system for storing data in a distributed way. It was originally designed to support a web crawling application that required only a write-once-read-many file system and where some amount of data loss could be tolerated. Getting existing data into HDFS requires effort, and HDFS does not natively support updates to data once it’s written. HDFS uses a simple unified table of metadata that is stored in memory on a single node to keep track of where each file block is written. This means that there is a single point of failure for the entire HDFS cluster and the amount of data you can store across the whole cluster is ultimately limited by the size of memory on that single node. The result is that an HDFS-based Hadoop cluster cannot scale to accommodate large numbers of files, and worse, HDFS does not have strong guarantees about data integrity.

So, while HDFS works great under the covers for some simple and small reporting applications, it really wasn’t designed to hold up under the pressure of running multiple modern enterprise applications at once.

If you build your big data platform on top of HDFS, you quickly realize that you need distinct systems to ingest data, to handle transactional workloads, and to run analytics and reporting. This is all because HDFS was designed to support a very limited workload that didn’t require a robust, high-performance file system underneath. Trying to force HDFS to support the diverse needs of modern enterprises quickly breaks down. If you rely on HDFS for your big data platform, you need separate redundant servers to provide availability. You also need separate clusters to handle different types of workloads (e.g. operational and analytical) as well as a separate cluster to handle the ingestion of streaming data. You also soon discover that there is a fundamental limit to the amount of data you can store in HDFS and that HDFS is prone to data corruption. What’s worse, from a high level, your IT infrastructure now resembles the fragmented system you were trying to replace. You are stuck with unnecessary IT overhead and the same problem of having to wait for data to move through multiple phases before it’s available to you to inform your business decisions.

The MapR Converged Data Platform: Simple, Fast, and Powerful

Unlike the Hadoop systems sold by other vendors, the MapR Converged Data Platform is the only big data platform built on the MapR patented data platform services layer—not HDFS. The MapR data layer, called MapR Platform Services, provides a common set of data services for high availability, disaster recovery, security, and multi-tenancy. Moreover, it exposes converged file system (MapR-FS), database (MapR-DB), and event streaming (MapR Streams) services and interfaces for application developers. MapR-FS is a true POSIX-compliant, distributed file system. It’s written in C so it’s fast. The metadata for MapR-FS is stored in a distributed way alongside the data itself; this means that MapR-FS supports fast distributed transactions without having to consult a centralized metadata repository like in HDFS. The MapR distributed transactions ensure data integrity, meaning no data loss or corruption. It also means the file system works like it should; you can read, write, and update any of the data stored in MapR-FS, and the size of data you can store on your MapR cluster is practically unlimited.

As a result, MapR is the only big data platform that can support all your applications without needing separate clusters or needing to store the data redundantly; everything runs in a single cluster. There is no limit to the amount of data you can store in MapR, so your system will scale with your business. The MapR support for high availability and integrity is baked into the design of the data layer ensuring business continuity. Perhaps most importantly for today’s business requirements, MapR gives you true realtime access to your data as it first enters your organization, meaning that your business processes have immediate access to all of your data and can make decisions on-the-fly as business is happening. So when considering which big data platform to use in your organization, remember that the foundation matters. The MapR Converged Data Platform was designed different to meet the needs of the modern enterprise in a simple, unified, versatile platform.

About the Author

Crystal Valentine is a big data scientist and practitioner. She is currently a tenure-track professor in the Department of Computer Science at Amherst College where she teaches courses on Big Data, Principles of Database Design, and Computational Biology, and is a consultant for MapR. She has had several academic publications in the areas of algorithms, high-performance computing, and computational biology and holds a patent for Extreme Virtual Memory. She is also a consultant for equity investors focused on Tech Sector companies and is the founder and managing partner of Frost Consulting LLC. Previously she spent four years as a consultant at Ab Initio Software working with Fortune 500 companies to design and implement high-throughput, mission-critical enterprise computing applications, and before that, collaborated on research at MIT Lincoln Laboratory on high-performance matrix computations. Dr. Valentine graduated Magna Cum Laude and Phi Beta Kappa with a B.A. in Computer Science from Amherst College and received her doctorate in Computer Science from Brown University. She was a Fulbright Scholar to Italy.


Download Now