From IT to IoT–the Genesis of MapR Streams

Over the last 5 years of shipping product we’ve watched our customers get enormous value out of storing and processing big data. The use cases are far and wide, from performing predictive maintenance on oil rigs to building fraud and risk models on financial transactions. When we stepped back and looked at the commonalities that exist among these use cases one thing jumped out–nearly all “big” data is generated one event at a time. There are many examples of event-based data sources, from IT sources like web logs and application metrics to IoT (Internet of Things) sources like smart devices, biometrics, and sensors.

When data is generated one event at a time, companies can get even more value by collecting and processing it in real-time. That’s why we built MapR Streams. With MapR Streams, we’re building global, IoT-scale publish-subscribe event streaming directly into our platform–alongside our distributed file system (MapR-FS) and NoSQL database (MapR-DB)–creating the industry’s first Converged Data Platform.   

Why is converging all of these services into a single platform important? Let’s look at two examples from of our customers, comScore and Liaison Technologies, who are particularly eager to build breakthrough applications using a mix of real-time and batch analytics, database, and streaming technologies:

  • comScore, a provider of digital media analytics and digital marketing intelligence, processes over 1.8 trillion internet and mobile events every month on MapR. Ad agencies, publishers, marketers and financial analysts rely on comScore for solutions in online audience measurement, ecommerce, advertising, search, video and mobile. In order to provide analytics to their customers, they run Hadoop jobs that are normalized and then loaded into a relational database for analysis and reporting. By using MapR Streams, they are planning on moving these insights into a more real-time view for their clients.
  • Liaison Technologies is another MapR customer that will greatly benefit from real-time event streaming. They provide cloud-based solutions to help organizations integrate, manage and secure data across the enterprise. One vertical solution they provide is for the healthcare and life sciences industry, which comes with two challenges–meeting HIPAA compliance requirements and the proliferation of data formats and representations. With MapR Streams, the data lineage portion of the compliance challenge is solved because the stream becomes a system of record by being an infinite, immutable log of each data change. To illustrate the latter challenge, a patient record may be consumed in different ways–a document representation, a graph representation, or search–by different users, such as pharmaceutical companies, hospitals, clinics, physicians, etc. By streaming data changes in real-time to the document, graph, and search databases, users always have the most up-to-date view of data in the most appropriate format. Further, by implementing this service on the MapR Converged Data Platform, Liaison is able to secure all of the data components together, avoiding data and security silos that alternate solutions require.

Without a converged platform, companies are forced to deploy these types of applications on at least three data silos–a messaging cluster, a Hadoop cluster, and a NoSQL database cluster. Silos mean independent clusters that need to be provisioned, managed, and secured using different tools and methods, which means more servers and more overhead. Worse, silos require data to constantly be moved, introducing delays, duplication, and inconsistency between systems. With the MapR Platform, data movement and duplication is avoided because MapR Streams data is available not only to stream-oriented tools, but also batch-oriented tools like MapReduce and Hive.

What is IoT-scale? IoT implies two things–globally distributed endpoints and enormous volumes of data. MapR Streams effortlessly scales to billions of events per second due to its linear scalability and ability to handle over 1 million events per second per node in reliable mode. When endpoints are distributed globally, so must the application infrastructure to minimize communication delays. MapR Streams can replicate event data between thousands of geographically-distributed clusters interconnected arbitrarily–in a tree, a ring, a star, or a mesh–with built-in loop prevention. Further, event metadata like message offsets and consumer cursors are carried alongside the data, allowing endpoints to move between clusters when appropriate. For example, this is critical in powering smart city initiatives, where cars need to consume a continuous stream of data from road sensors and other cars, switching between clusters as they move around to minimize latency.


How is this possible? Again, the answer is convergence. We’ve spent over six years solving the hard problems of distributed data systems. Initially, we focused on writing data reliably with synchronous replication between multiple nodes, distributing metadata between all nodes in the cluster so there isn’t a single point-of-failure, recovering and rebalancing after node failures, and replicating data between multiple clusters for disaster recovery. We built a foundation on which we could add new data services. Just two years after releasing version 1.0 of our platform, we added MapR-DB, a NoSQL database, that leveraged these capabilities and soon added a new one, master/master real-time table replication. Now, two years later, we are adding a publish-subscribe interface to build the industry’s first global, IoT-scale big data event streaming service.

We can’t wait to see what you build with it.



MapR Converged Data Platform: An Architectural Foundation for Data-Driven Enterprises
Learn how you can manage your data flows properly via this new data platform that is fast becoming an operating system for data and a global system of record.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free