Where do you go from here?
Try reexamining your goals for different projects and see what advantages you might gain from transitioning to a universal stream-based approach in addition to the specific benefits for your real-time analytics applications.
The fact is, there’s a revolution in what you can do with streaming data for a wide variety of use cases, from IoT sensor data to financial services, telecommunications, web-based business, retail, healthcare, and more. New technologies that efficiently handle continuous event data with speed at scale are part of why this revolution is possible. Another key ingredient is a new way to design architecture that exploits these emerging technologies. The big change is to see the power in a universal stream-based design. This does not mean that streaming data is used for everything, but it does mean that streaming becomes a common approach rather than something considered only for specialized, real-time projects.
There are great benefits to be gained when stream-based designs for big data architectures become a habit.
At the heart of effective stream-based architecture is the message passing itself. A big difference between stream-based and traditional design (or even people’s preconception of streaming) is that the messaging layer plays a much more prominent role. It can and should be used for more than just a step to precede real-time analytics, although it is essential for processing streaming data in these applications. For this type of messaging to be effective, it needs to be a Kafka-esque style tool. New technologies continue to be developed, but at present we see Apache Kafka and MapR Streams as good choices for the messaging layer to support the capabilities needed for an effective stream-based system. Whatever stream messaging technology you choose, you should ask if it has the following essential capabilities.
For the processing components of modern streaming architectures, there are a variety of strong technologies. We see particular promise with Apache Spark Streaming and with Apache Flink projects, each of which takes a somewhat different approach. Spark Streaming is an additional feature of the widely popular Spark software. It takes advantage of in-memory processing for speed and uses a special case of batch processing—microbatches—to approximate real-time analytics. Flink is a new technology that also provides speed at scale but approaches streaming from the side of real-time stream processing that can be cut into batch processing as necessary. Both systems are very attractive options to complement the messaging layer.
One of the benefits of adopting a stream-based architecture with effective messaging is that it gives you a system that is faster due to less data motion. This approach is convenient: there is less administration needed and fewer moving parts to coordinate. It’s a powerful way to support microservices that in turn make your organization more agile. A messaging component used in the right places in an architectural design serves to decouple services; the source of data does not have to coordinate with the consumer. That’s also why persistence matters for messages: if the consumer is not available when the message is delivered, that’s OK—it will be available when it is needed. It’s not that a query-and-response approach is never useful; it’s just that the stream-based messaging layer can be powerful in many parts of the design.
Another aspect of these designs and the desire for flexibility is to provide data that multiple consumers will use in different ways. That underlines the importance of delivering and persisting raw data in many situations, because at the time you design an architecture and data flow, you may not know all the applications for which you may need this data or indeed what aspects of the data will ultimately be important.
Effective handling of streaming data lets you more easily respond to changing events and react to life as it happens by acting on real-time insights.
Geo-distributed replication of data streams greatly expands the impact of stream-based architectures. We provided an example use case in Chapter 7: Geo-Distributed Data Streams, but the advantages of being able to rapidly share streaming data across multiple data centers applies to use cases in many sectors. At present, MapR Streams is the messaging technology that best fits these capabilities.
As you plan new projects, building your design based on a streaming architecture becomes fairly straightforward, and that in turn gives you greater flexibility for future modifications. But how do you incorporate the stream-based style of architecture when you have legacy services?
The good news is that it is easier to migrate to messaging-style applications than you might think. The flexibility imparted by a broader role for messaging also gives you an effective and relatively convenient way to incorporate change into your legacy projects. Here’s how it works.
One of the limitations of traditional architectures is that even if they work efficiently on those jobs for which they were originally designed, when you try to add a service or make a modification, change is difficult. This is true, in part, because of strong dependencies between services, as suggested by the diagram in Figure 8-1.
Suppose you want to make a change to one of your legacy services. The coupling of components in the traditional design means that the changes you make in Service 1 may also affect Services 2 and 3. The potential for such change means that the teams supporting the affected services need to be involved in the design decisions for Service 1. That can lead to bureaucratic deadlock. But you can be free to make this modification if you insert a messaging queue between the services and the database, as shown in Figure 8-2.
The addition of the messaging layer for updates allows you to make modifications to a service without having an unwanted impact on the others. Here, all updates from Services 1–3 go through the intermediate step of a message queue, shown in Figure 8-2 as a tube labeled “updates,” before reaching the shared database. Services 1–3 act as producers, and the shared database becomes a consumer of the message queue. This intermediate component decouples the producers and consumers of the data.
When you are ready to modify Service 1, you first make a copy of the database that will not be shared, as depicted in Figure 8-3. Instead, Service 1 will read from this database while Service 2 and Service 3 will continue to read from the shared database. Note that all the updates still go to the same message queue, but now the unshared database becomes a second consumer of the data. This in effect isolates the legacy services from the impact of modifications to Service 1.
This example is an extreme simplification to help you see how decoupling can allow you to make the transition from a traditional system to this new streaming architecture in stages. In practice, there are likely to be some issues that come up during the decoupling. One of the most serious is a dependency on transactional updates to the database. In some cases, such transactional updates can be isolated to a single service or some services may treat the database as a pure consumer, thus making the decoupling very easy. Another strategy that may be useful is to send high-level descriptions of the transaction into the message queue rather than just the low-level updates to the database tables. Sending high-level updates helps abstract the system away from the details of the database and is therefore recommended in any case.
The prospect of some difficulties should not dissuade you, however. This same basic process has been used by substantial companies in order to move legacy systems into microservices architectures.
With this new stream-based approach to designing architecture for big data systems, you gain greater control over who uses data and how you can build new parts of your system as you go forward. You also join the flood of people who are beginning to take advantage of streaming data with all its benefits.
"Overall, streaming technology enables the obvious: continuous processing on data that is naturally produced by continuous real-world sources (which is most “big” data sets)."1
Fabian Hueske and Kostas Tzoumas, Committers and PMC members of Apache Flink
Most of the time, streaming data is just a better fit for how life happens.