With the Internet of Things expected to bring at least 21 billion devices online by 2020 (according to Gartner), a lot of people are excited about the potential value of event streaming, that is, ingesting and analyzing lots of real-time data for immediate decision-making. But streaming also introduces new concepts and components that need a closer look. This blog post is intended to provide an introduction to the components of a typical streaming architecture and various options available at each stage.
Three Components of a Streaming Architecture
A producer is a software-based system that is connected to the data source. Producers publish event data into a streaming system after collecting it from the data source, transforming it into the desired format, and optionally filtering, aggregating, and enriching it.
The streaming system takes the data published by the producers, persists it, and reliably delivers it to consumers.
Consumers are typically stream processing engines that subscribe to data from streams and manipulate or analyze that data to look for alerts and insights. There are lots of options to choose from, and more are on the way. Let’s look at what your options are for each stage.
Stage 1: Producers
Data producers collect the data from data sources, convert it to the desired format, and publish the data into streaming platforms such as Apache Kafka and MapR Streams. Apache Flume is commonly used as a producer to Kafka. StreamSets is an up-and-coming data collector that may be worth a look.
Apache Flume is a distributed system for efficiently collecting, aggregating, and moving large amounts of data. Flume has a source and sink architecture. A Flume source collects the event data from the data sources. A Flume sink puts the event into an external repository, which is often a streaming system like Apache Kafka or MapR Streams.
StreamSets Data Collector is open source software for the development and operation of complex data flows. It provides a graphical IDE for building ingest pipelines. StreamSets can help you connect with Kafka without writing a single line of code. StreamSets Data Collector includes out-of-the-box connectors for Kafka and many other sources and destinations.
Stage 2: Streaming System
Two event transport systems that can easily scale to deliver billions of events per second are Apache Kafka and MapR Streams. The ability to linearly scale to deliver billions of events per second differentiates Kafka and MapR Streams from traditional messaging queues like Tibco EMS and IBM MQ. Kafka and MapR Streams both use a publish-subscribe model in which the data producer is the publisher and the data consumer is the subscriber.
Apache Kafka is great at handling large volumes of data. You can set up a cluster as a data backbone, which gives you great scalability. You can then easily expand the cluster as needed without downtime. Kafka also stores messages on disk and replicates them within the cluster to reduce the risk of data loss when you encounter a hardware failure.
MapR Streams is like Kafka – in fact, it uses the Kafka 0.9 API – but it has certain enterprise features that provide additional support for very large and geographically diverse networks where data integrity is essential. MapR Streams is integrated with the MapR Converged Data Platform, which combines file storage, database services, and processing frameworks in a single cluster. That means batch, interactive, and stream processing engines all have direct access to event streams, which reduces data movement and ensures consistency. Learn more about MapR Streams here.
Stage 3: Consumers (Processing)
MapR Streams and Kafka can deliver data from a wide variety of sources, at IoT scale. It’s then up to the processing engines to do something with it. Four important engines to know about include Apache Spark Streaming, Apache Flink, Apache Storm, and Apache Apex.
Spark Streaming is a built-in component of Apache Spark. Spark Streaming can consume event streams from MapR Streams, Kafka, and many other systems. By being a built-in component of Spark, Spark Streaming runs in-memory, and allows you to run ad-hoc queries on stream data. Spark Streaming can be more accurately described as “micro-batching,” or processing small amounts of batch information in quick bursts.
Apache Storm is another popular event processing engine. Unlike Spark, Storm is a pure real-time event-based analytics engine, which makes it most useful in situations in which each event needs to be processed instantaneously. Storm actually processes each event as soon as it is delivered.
Apache Flink works in-memory and is notable for its speed and scalability. Similar to Storm, it is a pure real-time event-based processing engine. The differences between the two are quite technical, having to do with such things as the way each ensures data reliability, the programming languages they support, and the capabilities of their respective APIs. Flink supports the Apache Storm API, to make the transition from Storm easy for developers familiar with Storm.
Apache Apex (incubating) is a recent entry to the market, having become available in open source only last June. It’s a YARN-native platform that unifies stream and batch processing. It lowers the expertise required to write big data applications by providing a simple API that enables users to write or reuse generic Java code. The project is still in the incubation phase, so there isn’t a lot of field experience to go by. Apex is certainly worth keeping an eye on, however.
You can see that the enthusiasm over real-time processing is being met with a host of technologies. And the landscape is constantly evolving. If you are interested in taking a deeper dive, get the “Streaming Architecture: New Designs Using Apache Kafka and MapR Streams” ebook authored by Ted Dunning and Ellen Friedman.