In the previous chapter, we looked at some of the reasons why so many people are getting interested in using streaming data. Now we explain the how—the ways to build a streaming system to best advantage.
Emerging technologies for message passing now make it possible to use streaming almost everywhere. This innovation is the biggest idea in this chapter.
Stream-based architecture provides great benefits when employed across any or all of the data activities for your enterprise.
The new designs we have in mind rely on a large-scale shift in the overall design approach you use to build systems. This transition is not just about acquiring a particular technology or the skill to use a certain fast algorithm—it is about change on a much broader and more fundamental level. It is also unusual among advances in system architecture in that it can be introduced incrementally with accelerating benefits as you convert more and more services.
The need for that level of overall change toward a streaming architecture may not be apparent to everyone right away. The initial lure to use streaming data may be a particular project or goal that requires real-time analytics. For example, suppose your organization is interested in building a dashboard for real-time updates. You might initially identify the streaming data source of interest and look for a powerful stream-processing software, such as Apache Spark Streaming. You like the in-memory aspect of this tool because it can provide near–real time processing, and that meets your particular goals. You’ll export the results from this analytics application to the dashboard for almost–real time updates. You also like the idea that the raw streaming data can be analyzed right away, without needing to be saved to files or a database. Perhaps you’re thinking this is all you need to build a successful project.
But suppose the analytics program has a temporary interruption or slow down. The incoming stream of data might be dropped. You want some insurance, so you also plan for a message queue to serve as a safety buffer as you ingest data en route to the Spark-based application. This type of design for a single-purpose data path for real-time stream processing is shown in Figure 2-1. For the purposes of this chapter, we will keep the examples generic to focus attention on the pattern of the design in each case.
The plan shown in Figure 2-1 isn’t a bad design, and with the right choice of tools to carry out the queuing and analytics and to build the dashboard, you’d be in fairly good shape for this one goal. But you’d be missing out on a much better way to design your system in order to take full advantage of the data and to improve your overall administration, operations, and development activities.
Instead, we recommend a radical change in how a system is designed. The idea is to use data streams throughout your overall architecture—data streaming becomes the default way to handle data rather than a specialty. The goal is to streamline (pun not intended) your whole operation such that data is more readily available to those who need it, when they need it, for real-time analytics and much more, without a great deal of inconvenient administrative burden.
The idea that you can build applications to draw real-time insights from data before it is persisted is in itself a big change from traditional ways of handling data. Even machine learning models are being developed with streaming algorithms that can make decisions about data in real time and learn at the same time. Fast performance is important in these systems, so in-memory processing methods and technologies are attracting a lot of attention.
However, as mentioned in Chapter 1: Why Stream?, the ability to analyze streaming data directly without having to first save it to files or a database does not mean that it’s not useful to persist it—just that persistence can be done independently. That goes for other processing steps as well. With an overall stream-based approach that cuts across multiple systems like we are advocating, one important characteristic is that data can be used immediately upon ingestion, but it should not disappear if the downstream process is not ready for it when the data arrives. Messages should be durable.
In addition, these architectures need to handle very large volumes of data, so the tools used to implement them need to be highly scalable throughout the system. It’s also important to design systems that can handle data from multiple data sources, making it available to a variety of data consumers.
“Data Integration means making available all the data that an organization has to all the services and systems that need it.”1
An important advantage of a system designed to use streaming data as part of overall data integration is the ability to change the system quickly in response to changing needs. Decoupling dependencies between data sources and data consumers is one key to gaining this flexibility, as explained more thoroughly in Chatper 3: Streaming Architecture: Ideal Platform for Microservices, which deals with streaming data and microservices.
A generalized view of these characteristics of a stream-based architecture is shown in Figure 2-2. Some details are omitted to keep the diagram simple. In this case, we’ve improved on as well as expanded the single-purpose design for real-time updates to a dashboard that was outlined in Figure 2-1. To make the comparison easy, the components that were present in Figure 2-1 are shown as shaded in Figure 2-2; the unshaded parts highlight additional projects as well as a modification of the original data flow for the real-time dashboard.
First of all, notice that the results output from the real-time application now goes to a message stream that is consumed by the dashboard rather than reaching the dashboard directly. In this way, the results can easily be used by an additional component, such as the anomaly detector shown in this hypothetical example. One nice feature of this style of design is that the anomaly detector can be added as an afterthought. The flexible system design lends itself to modifications without a great deal of administrative hassles or downtime.
Our overall design also takes into account the desire to use multiple data sources. Since the consumers of messages don’t depend on the producers, they also don’t depend on the number of producers. The messaging system also makes the raw data available to non–real time processes, such as those needed to produce a monthly report or to augment data prior to archiving in a database or search document. This happens because we assume the messaging system is durable. As in the healthcare example described in Chapter 1: Why Stream?, our streaming architecture design supports a variety of applications and needs beyond just real-time processing.
As you think about how to build a streaming system and which technologies to choose, keep in mind the capabilities required to support this design. Tools and technologies change, and new ones are developed, particularly in response to the growing interest in these approaches. But the fundamental requirements of an effective streaming architecture are more constant, so it’s important to first identify the basic needs of the system as you consider what technologies you will use.
Message-passing infrastructure is at the heart of what makes this new approach work well. Let’s examine some of the key capabilities of the messaging component if we are to take full advantage of the universal stream-based architecture presented in this book. To do that, think about what we’ve said about how the message-passing layer needs to work in our design, as represented in Figure 2-3.
You’ll see terminology used in somewhat different ways when describing different systems, so please think in terms of the underlying meaning here. The data source is what sends a series of event data to the messaging system. It’s sometimes called a producer or publisher. In our system, we expect the messaging technologies to handle messages from a huge number of producers.
The producer sends this event data without knowledge of the process that will make use of it. We call the thing that uses the messages the consumer, also sometimes called the subscriber. Each stream of messages is named (we call this the topic). The consumer (or a group of consumers) requests or subscribes to any topics it needs. We will say more about the details of these terms in Chapter 4: Kafka as Streaming Transport and Chapter 5: MapR Streams. The beauty of this approach is that the messages can be sent whether or not the consumer is ready to receive them, and they stay available until the consumer is ready. That is an essential aspect of any messaging system we select. And for our design, the messaging software must be able to provide messages to multiple consumers.
Our architecture is also intended for projects using very large-scale data and requiring the ability to handle data at very high rates. Of course, if we want to use these systems in production, we also need to be confident that our messaging choice provides fault tolerance.
What characteristics, then, should we look for as being essential in our messaging technology if it is to support these needs?
A messaging tool must not require that the producer know about the consumers that will process the messages.
This is implied for full isolation of producer and consumer to work. Otherwise, messages will disappear if the producer and consumer are not coordinated to take the delivery as soon as the data appears.
Extreme performance is required for modern use cases involving streaming data. If we want to use messaging as the core backbone of our systems, we have to handle huge message rates.
It is unusual for message-passing systems to be able to maintain full isolation of producer/consumer with durability without sacrificing speed. However, to be appropriate for a universal stream-based architecture, these characteristics must exist together.
This is not an unusual feature, but it is an important one, as it allows consumers to select the data they need.
This is a highly desirable characteristic. Consumers can go back to whatever point they wish to begin and read the sequence from that point. This lets them restart a sequence. Producers can produce events and know that they will be processed in order, thus allowing logical dependencies between events.
This characteristic is self-explanatory and required for critical systems.
This capability is not required in every use case, but in many cases it is an absolute requirement because the architecture needs to function across multiple data centers in different locations without sacrificing any of the above capabilities.
Where do we find messaging tools that can meet these strenuous requirements? There are two at present that are excellent choices to meet the needs of a universal stream-based architecture: Apache Kafka, which we describe in more detail in Chapter 4: Kafka as Streaming Transport, and MapR Streams, which uses the Kafka API that we examine in Chapter 5: MapR Streams.
In a way, the choice of messaging tools organizes itself at present into two categories: the Kafka-related group (Kafka and MapR Streams) and the Others.
Messaging systems like Kafka work very differently than older message-passing systems such as Apache ActiveMQ or RabbitMQ. One big difference is that persistence was a high-cost, optional capability for older systems and typically decreased performance by as much as two orders of magnitude. In contrast, systems like Kafka or MapR Streams persist all messages automatically while still handling a gigabyte or more per second of message traffic per server. One big reason for the large discrepancy in performance is that Kafka and related systems do not support message-by-message acknowledgement. Instead, services read messages in order and simply occasionally update a cursor with the offset of the latest unread message. Furthermore, Kafka is focused specifically on message handling rather than providing data transformations or task scheduling. That limited scope helps Kafka achieve very high performance.
The development of a rich collection of technologies for processing streaming data, along with the evolution of effective, highly scalable messaging tools, is the driver for many more organizations to seek real-time insights from streaming data. In this book up until now, we have used the term “real time” to mean relatively low latency, but there are distinctions between technologies that approximate real time and those that actually analyze data as a real-time or very low-latency stream. For many applications, depending on SLAs, this distinction is not very important, but there are some situations in which “real-time” requirements are just that.
A detailed examination of technologies and methods for streaming analytics is beyond the scope of this short book, but we do provide an overview of desired capabilities and examine several choices, including how they differ. First we very briefly describe four technologies of interest: Apache Storm, Apache Spark Streaming, Apache Flink, and Apache Apex. Then, as we did for messaging, we take a look at some of the key capabilities for analytics that best support the stream-based architectures. We also compare some of the available technologies in the context of these capabilities.
What each project can do will change as they evolve. The qualities that best support streaming analytics, however, are relatively constant.
Given that each project’s capabilities will continue to evolve, understand that the descriptions and comparisons of specific technologies are only general and represent a moment in time, but they should serve as an aide to help you think concretely about the features you’ll want to look for.
Apache Storm was a pioneer in real-time processing for large-scale distributed systems. The project website describes Storm as “doing for realtime processing what Hadoop did for batch processing.” It’s an accurate observation that the computational framework part of Hadoop, MapReduce, introduced a wide audience to batch processing at scale, and Storm added an early way to deal with real-time processing in the Hadoop ecosystem. The project Storm started outside of Apache under the leadership of Nathan Marz and has continued to evolve since it became a top-level Apache project.
Storm’s approach is real-time processing of unbounded streams. It works with many languages. Recent additions intend to add windowing capabilities for Storm with an “at-least-once” guarantee, but historically, Storm has performed best with pure transformations or when windows could be defined at the application level rather than at the platform level. Storm’s design has up to now involved what is known as “early assembly,” in which rows are represented by Java objects that are actually constructed as they are read. This can limit performance relative to systems like Flink that use byte-code engineering to make it look like they are doing something else.
Spark Streaming is one of the subprojects that comprise Apache Spark. Spark originated as a university-based project developed at UC Berkeley’s AMPLab starting in 2009. The project entered the Apache Foundation in 2013 and became a top-level Apache project in 2014. In the last approximately three years, the overall Spark project has seen widespread interest and adoption.
Spark accelerated the evolution of computation in the Hadoop ecosystem by providing speed through an innovation that allowed data to be loaded into memory and then queried repeatedly. Spark Core uses an abstraction known as a Resilient Distributed Dataset (RDD). When jobs are too large for in-memory, Spark spills data to disk. Spark requires a distributed data storage system (such as Cassandra, HDFS, or MapR-FS) and a framework to manage it. Spark works with Java, Python, and Scala.
Spark Streaming uses microbatching to approximate real-time stream analytics. This means that a batch program is run at frequent intervals to process all recently arrived data together with state stored from previous data. Although this approach makes it inappropriate for low-latency (“real real-time") applications, it is a clever way to extend batch process to near–real time examples and works well for many situations. In addition, the same code can be used for batch processing applications as for streaming applications. Spark Streaming provides exactly-once guarantees more easily than a true real-time system. Where shorter latency (real-time) analytics are needed, people often employ a combination of tools with Spark Streaming/Spark Core plus Apache Storm for the real-time side of things.
Apache Flink is a highly scalable, high-performance processing engine that can handle low latency as well as batch analytics. Flink is a relatively new project that originated as a joint effort of several German and Swedish universities under the name Stratosphere. The project changed its name to Flink (meaning “agile or swift” in German) when it entered incubation as an Apache project in 2015. Flink became a top-level Apache project later that year and now has an international team of collaborators. With increased public awareness, Flink’s popularity grew rapidly in 2015, and some companies already use it in production.
Flink has the capability to handle the low-latency, real-time analytics applications for which Storm is appropriate as well as batch processing. In fact, Flink treats batch as a special example of streaming. Flink programs are developer friendly, they are written in Java or Scala, and they deliver exactly-once guarantees. Like Spark or Storm, Flink requires a distributed storage system. Flink has already demonstrated very high performance at scale, even while providing a real-time level of latency.
Apache Apex is a scalable, high-performance processing engine that, like Apache Flink, is designed to provide both batch and low-latency stream processing. Apex started as an enterprise offering, DataTorrent RTS, but the core engine was made open source, and the project entered incubation at the Apache Software Foundation in summer of 2015. Apex describes itself as being “developed with YARN in mind.” As such, it runs as a YARN application but avoids overlap in functionality with YARN. Apex supports programming in Java or Scala and was designed particularly to provide an easy way for Java programmers to build applications for data at scale as well as to reuse Java code. Like the other streaming analytics tools described here, Apex requires a storage platform. A particular advantage of Apex is the associated Malhar library of functions that cover a number of analytics needs.
Our description of tools for streaming analytics is neither exhaustive nor definitive. As we have said, all of these technologies are evolving, so descriptions of individual projects as well as comparisons of specific features and performance capabilities cannot remain accurate for long. That is in part why we encourage you to focus on the impact of capabilities for this style of architecture and to continue to assess different choices as they arise. That said, you may find it helpful to see a brief comparison of some key capabilities, which we provide here:
Any technology used for analytics in this style of architecture needs to be highly scalable, capable of starting and stopping without losing information, and able to interface with messaging technologies with capabilities similar to Kafka and MapR Streams (described previously in this chapter).
These are relative terms, but for best practice, a modern architecture often needs to be designed to deal with batch and streaming applications even at very low latency, either to meet the requirements of existing applications or to be positioned to meet future needs. High performance is also usually a requirement.
Technologies that can deliver processing that ranges from batch to low latency, as well as real-time processing without sacrificing performance, are attractive choices.
This observation does not mean that every situation requires very low-latency capabilities; indeed these are somewhat unusual, although they are becoming more common. A technology’s features should meet the requirements of the current situation, but there is some advantage to building for future needs. At present, Flink and Apex probably have the strongest performance at very low latency of the given choices, with Storm providing a medium level of performance with real-time processing.
It is useful to provide exactly-once guarantees because many situations require them. For example, in financial examples such as credit card transactions, unintentionally processing an event twice is bad. Spark Streaming, Flink, and Apex all guarantee exactly-once processing. Storm works with at-least-once delivery. With the use of an extension called Trident, it is possible to reach exactly-once behavior with Storm, but this may cause some reduction in performance.
This term refers to the time period over which aggregations are made in stream processing. Windowing can be defined in different ways, and these vary in their application to particular use cases. Time-based windowing groups together events that occur during a specific time interval, and is useful for asking questions such as, “how many transactions have taken place in the last minute?” Spark Streaming, Flink, and Apex all have configurable, time-based windowing capabilities. Windowing in Storm is a bit more primitive.
Time may not always be the best way to determine an aggregation window. For instance, it is common to divide visitor activity on a website into sessions separated by periods of inactivity of at least a specified length, typically 30 minutes. This makes time-based windowing less useful because not every session ends at the same time. Another way to define a window is to build programmatic triggers, which allow windows to vary in length, but still require synchronization of windows for different aggregations. Flink, Apex, and Spark do trigger-based windowing. Flink and Apex also do windowing based on the content of data.
A good design for streaming architecture can be a powerful advantage for you across all the data flow of your systems, not just for real-time analytics. The design calls for certain capabilities in terms of message passing, stream processing, and persistence. It’s useful to understand the required or desired capabilities in order to assess the variety of tools that become available. Because this stream-based universal architectural design has at its heart a particular style of messaging, we will go into detail about the two technologies that currently best provide these capabilities—Apache Kafka (Chapter 4: Kafka as Streaming Transport) and Kafka-based MapR Streams (Chapter 5: MapR Streams)—but first we will examine in Chapter 3: Streaming Architecture: Ideal Platform for Microservices why stream-based architecture strongly supports a style of computing known as microservices.