Streaming data now is a big focus for many big data projects, including real time applications, so there’s a lot of interest in excellent messaging technologies such as Apache Kafka or MapR Streams, which uses the Kafka 0.9 API. Terminology can be confusing, however, especially with so many similar new names showing up. To clarify, here’s a few ideas and terms to keep in mind:
What’s the difference in MapR Streams and Kafka Streams?
This one’s easy: Different technologies for different purposes. There’s a difference between messaging technologies (Apache Kafka, MapR Streams) versus tools for processing streaming data (such as Apache Flink, Apache Spark Streaming, Apache Apex). Kafka Streams is a soon-to-be-released processing tool for simple transformations of streaming data. The more useful comparison is between its processing capabilities and those of more full-service stream processing technologies such as Spark Streaming or Flink.
Despite the similarity in names, Kafka Streams aims at a different purpose than MapR Streams. The latter was released in January 2016. MapR Streams is a stream messaging system that is integrated into the MapR Converged Platform. Using the Apache Kafka 0.9 API, MapR Streams provides a way to deliver messages from a range of data producer types (for instance IoT sensors, machine logs, clickstream data) to consumers that include but are not limited to real-time or near real-time processing applications.
What is a stream, a topic, a broker?
For best practice in working with streaming event data, it’s important to have a message delivery system that provides a replayable queue of messages. In other words, persistence matters. Time-to-live for messages should be configurable. In addition, an ideal messaging tool should handle multiple producers and consumers in a de-coupled manner: the message is available whether or not the consumer is available at that moment. It’s also important to handle streaming data from many sources in a way that can be easily identified or accessed.
As excellent messaging systems of this style, Apache Kafka and MapR Streams share many capabilities, but they function somewhat differently. In both systems, data is assigned to a topic, a flow of messages identified by name as a particular category. You might name a topic based on the IoT sensor or group of sensors from which data is being produced, for example. Both systems provide partitions of the topics to help with load balancing, as shown in Figure 1. In both systems, consumers (or consumer groups) subscribe to the topics of interest. This style is in contrast to some older messaging systems that broadcast data from producers to all consumers. MapR Streams can handle a larger number of topics than Kafka, but topics in both systems have a similar purpose.
Figure 1: Streaming event data can be assigned to a category (topic) in MapR Streams or in Apache Kafka’s messaging technology. Partitioning of data for a topic provides an advantage for load balancing. Multiple consumers within a consumer group may subscribe to partitions, as shown here. (Image © Dunning & Friedman 2016, used with permission. From Chapter 4 of book Streaming Architecture: New Designs Using Apache Kafka and MapR Streams.)
What about the term “broker”? Here’s where some differences appear. Kafka documentation defines the Kafka broker this way: “Kafka is run as a cluster comprised of one or more servers each of which is called a broker.” You may have seen diagrams indicating data coming from multiple consumers to a Kafka cluster broker and multiple consumers subscribing to data from the Kafka cluster. With MapR Streams, the situation is different: Messaging is not done on a separate cluster but instead is integrated into the platform along with files and tables. As a result, MapR does not have (or need) a broker. There is no equivalent to the Kafka broker because, unlike Kafka, messaging with MapR doesn’t have to be done on a separate cluster from where the main action takes place or, indeed, where data is stored.
Another potential confusion comes with the term “stream” in the MapR system. Unlike how Kafka handles topics, the MapR Streams feature of the converged platform provides a high level management capability called a stream, which is a collection of many topics for which shared policies are desirable. There is no equivalent in Kafka. Policies such as control of access and time-to-live are set at the stream level for MapR. There can be many streams (collections of topics) per MapR cluster.
An additional policy set at the stream level is a particularly strong capability for MapR Streams: reliable geo-distributed replication of messages. The following figure illustrates the function of a stream as a collection of topics in a MapR streaming system.
Figure 2: Unique to MapR Streams – a stream is a collection of topics. A stream is a first class object in the MapR Converged Platform. (Image © Dunning & Friedman 2016, used with permission. From Chapter 5 of book Streaming Architecture.)
The capabilities of the modern style of messaging system open new possibilities in the way people work with streaming data. The impact can go far beyond just data delivery for a particular low latency application, particularly as organizations begin to recognize the potential of these new technologies and adapt powerful stream-based designs for their data architectures.
For more ideas about using streaming data, read free online the O’Reilly book Streaming Architecture: New Designs Using Apache Kafka and MapR Streams, © Ted Dunning and Ellen Friedman.
To get started with sample programs, see these blogs: