MapR Streams brings integrated publish/subscribe messaging to the MapR Converged Data Platform. In this post, we will give a high-level overview of the components of MapR Streams. Then, we will follow the life of a message from a producer to a consumer, with an oil rig use case as an example.
MapR Streams Concepts
Topics are logical collections of messages that are managed by MapR Streams. Topics decouple sources, which are the producers of data, from consumers, which are applications for processing, analyzing, and sharing data. Topics organize events: producers publish to a relevant topic and consumers subscribe to the topics of interest to them.
Topics are partitioned for throughput and scalability. Partitions, which exist within topics, are parallel, ordered, sequences of messages that are continually appended to. Partitions make topics scalable by spreading the load for a topic across multiple servers. Producers’ publishing is load balanced between partitions by MapR Streams, and consumers can be grouped to read in parallel.
A stream is a collection of topics that you can manage together. Streams can be asynchronously replicated between MapR clusters, with publishers and listeners existing anywhere, enabling truly global applications.
The MapR Streams replication feature gives your users real-time access to live data distributed across multiple clusters and multiple data centers around the world.
You can replicate streams in a master-slave, many-to-one, or multi-master configuration between thousands of geographically distributed clusters interconnected arbitrarily – in a tree, a ring, a star, or a mesh. MapR Streams detects loops and prevents message duplication.
With Streams replication, you can create a backup copy of a stream for producers and consumers to fail over to if the original stream goes offline. This feature significantly reduces the risk of data loss should a site-wide disaster occur, making it essential for your disaster recovery strategy.
Life of a Message
To show you how these concepts fit together, we will go through an example of the flow of messages from a producer to a consumer.
Imagine that you are using MapR Streams as part of a system to monitor oil wells globally.
Your producers include sensors in the oil pumps, weather stations, and an application which generates warning messages. Your consumers are various analytical and reporting tools.
In a volume in a MapR cluster, you create the stream /path/oilpump_metrics. In that stream, you create the topics Pressure, Temperature, and Warnings.
Of all of the sensors (producers) that your system uses to monitor oil wells and related data, let's choose an oil pump sensor that is in Venezuela. We'll follow messages generated by this sensor and published in the Pressure topic. When you created this topic, you also created several partitions within it to help spread the load among the different nodes in your MapR cluster and to help improve the performance of your consumers. For simplicity in this example, we'll assume that each topic has only one partition.
How Are Messages Sent? A Message Enters the System
- The sensor producer application sends messages to the Pressure topic using the MapR Streams producer client library.
- The client library buffers incoming messages.
- When the client has a large enough number of messages buffered or after an interval of time has expired, the client batches and sends the messages in the buffer. The messages in the batch are published to the Pressure topic partition on a MapR Streams server. For topics with multiple partitions, the MapR Streams server automatically load balances messages from producers in a in a sticky round-robin fashion. Producers can also influence which messages go to which partition by including a partition ID, or a key for hashing, with each message.
- When the messages are published to a partition, they are appended in order. Each message is given an offset, which is a sequentially numbered ID. Older messages have lower numbered offsets, while the newest messages have the highest numbers.
- Each partition and all of its messages are replicated for fault tolerance. The server owning the primary partition for the topic assigns the offset IDs to messages and replicates the messages to replica containers within the MapR cluster.
- The server then acknowledges receiving the batch of messages and sends the offset IDs that it assigned to them back to the producer.
How Are Messages Read? The Message is Read from the System
A consumer application that correlates oil well pressure with weather conditions is subscribed to the Pressure topic. Many more consumers could be subscribed to it, too.
- When the consumer application is ready for more data, it first issues a request, using the MapR Streams client library, to poll the Pressure topic for messages that the application has not yet read.
- The client library asks if there are any messages more recent than what the consumer application has already read.
- Once the request for unread messages is received, the primary partition of the Pressure topic returns the unread messages to the client. The original messages remain on the partition and are available to other consumers.
- The client library passes the messages to the consumer application, which can then extract and process the data.
- If more unread messages remain in the partition, the process repeats with the client library requesting messages.
When Are Messages Deleted?
Since messages remain in the partition even after delivery to a consumer, when are they deleted? When you create a stream, you can set the time-to-live for messages.
Once a message has been in the partition for the specified time-to-live, it is expired. An automatic process reclaims the disk space that the expired messages are using.
The time-to-live can be as long as you need it to be. Messages will not expire if the time-to-live is zero, and will remain in the partition indefinitely.
You don’t have to worry about partitions getting too big to store on a single server; partitions will be re-allocated every 2GB to balance storage. MapR Streams can intelligently move partitions around in a cluster in order to spread the data out, allowing topics to be infinite, persistent storage.
MapR Streams provides reliable, global, IoT-scale publish-subscribe event streaming, allowing for the real-time collection and processing of events, paving the way to many use cases such as real-time alerting, monitoring, fraud detection, offers, and more. In this example, we saw how data can move from a producer like an oil well sensor to a consumer like an analytics application.
To find out more about MapR Streams, visit our MapR Streams page.