If you’re not already looking at ways to efficiently handle streaming data flow, chances are you will be soon. An increasing number of organizations are shifting their approach from a largely batch-based design to one that incorporates more streaming processes. What’s the allure?
For one thing, technologies designed to move and analyze streaming data can make setting up and loading a data warehouse simple and fast. And if you use a streaming processor with low latency, the insights you draw from data analysis happen quickly, giving you a more timely view of what is actually going on. Seeing results faster isn’t just a quantitative change: extremely low-latency applications make qualitative differences in what you are able to do. For example, consider what happens with anomaly detection: lower latency and the ability to analyze streaming data in real time means you have the potential for much better response times, a key requirement if you are to react in time to prevent damage, as with industrial equipment, or to shut down fraudulent attacks on a secure website to avoid substantial loss of funds or information.
In short, streaming data processes and real time analysis better fit how the world works in many situations. As a result, there is growing interest in technologies designed for streaming data. An interesting new tool to do this comes from the Apache Flink open source project. Flink is an attractive option because it provides a streaming data flow engine that delivers very high throughput even with low latency and fault tolerance on large scale distributed systems.
The name “Flink” reflects the project’s European roots as well as the software’s reputation for being fast and agile.
A screen shot of Google translation service from German to English shows off one of the reasons the name “Flink” was chosen when the project entered incubation at Apache.
A bit of background: The Flink project started in academia as a collaboration between research groups at several European Universities, and it was originally called Stratosphere. The project first caught my attention through an informal talk on the open stage at Berlin Buzzwords back in June of 2013, presented by two PhD students at Technische Universität Berlin, Sebastian Schelter and Kostas Tzoumas. Almost a year later I saw a more technical presentation by Stephan Ewan at NoSQL Matters in Cologne, just before Stratosphere was proposed to join the Apache Foundation incubator. Since that time the project has taken its new name of Flink (to avoid confusion with a pre-existing project), it has to graduated from incubation to top level project in just 8 months, and it has begun to catch the attention of a much broader audience. Kostas and Stephan, both still active Flink committers, have joined the company Data Artisans as CEO and CTO respectively. And now there is a large international community of Flink developers contributing to the open source project.
Picking a lively animal – the squirrel – as project logo was a good move as a symbol for how fast and agile the software is as well as how rapidly the Flink community is expanding, both in terms of contributors and users. Growing awareness and interest around this open source project was apparent at the Strata conference in London in May 2015 and even more so at Hadoop Summit in San Jose in June. So it was gratifying but not surprising to see a big turn out, over 60 people, for an evening event of the Bay Area Apache Flink meet-up group hosted at the headquarters of MapR Technologies in San Jose 27 August 2015.
Bay Area Apache Flink Meetup on 27 Aug at MapR
The evening offered a full program with five presenters. Three were visiting from research institutes in Sweden, and meet-up organizer Henry Saptura is from the Bay Area. The fifth speaker was Ted Dunning, Chief Application Architect at MapR who served as mentor for Flink as it came into the Apache incubator, provided an overview of a key concept: what is the significance of true streaming as compared to micro-batching based approaches.
Use Cases: Micro-batching or Streaming?
Just as the technologies for distributed big data platforms have become more refined and mature with expanding demand for scalable solutions, so too have technologies for streaming data flow and real time applications been evolving. Apache Storm was a pioneer in real time, using a process that involves true streaming. Apache Spark embraced a clever way to approximate real time using a micro-batching approach that made it possible to provide exactly once computing along with the advantages of in-memory processing. Now Apache Flink has developed software that, like Storm, is a real-time stream processor but like Spark having exactly once capability.
How are streaming and micro-batching different, and when is either approach appropriate? At the Flink meet-up, Ted explained the technical distinction as follows.
- Input arrives as records in a particular sequence
- Output is needed as soon as possible (but not sooner – time is required to verify sequence)
- Output never needs to be modified once is written
- Input is divided into batches, by number of records or by time
- As many input batches as are necessary (but a finite number) are retained
- A batch program is run against retained batches, with stored state
- A new state and all of the rows of output are recorded as streaming output
One of the characteristics of micro-batching is that state must be enumerated, serialized and preserved across batches, which is not easy to do and which makes it difficult to meet the “soon as possible” need of true streaming. Often micro-batching is implemented without state.
In many situations either approach – true streaming or micro-batching – will do. A simple example Ted described to illustrate this idea is the computation of rolling monthly total sales on daily intervals. A clever implementation of this approach computes daily sums and then adds. True steaming is not essential in a situation like this; a micro-batch based approach is fine to use.
There are some situations, however, that need a true streaming process such as that which Flink offers. Remember that with many real world situations, the input to the program from the process or event of interest essentially never stops. One way to recognize a use case that requires true streaming in order to work well is to look for one in which there are no natural boundaries for batches. Trying to compute the rolling monthly time-on-site for visitors to a website is one such case. The number of visits might be updated daily (or even every minute), but the problem is that the time-on-site depends on definition of a session. How do you recognize the start and stop of a session, a series of activities between periods of inactivity? There are no natural boundaries that reasonably can be used to define batches; a true streaming architecture is required. Anomaly detection, as mentioned earlier, is another case that profits by streaming analysis rather than micro-batching.
Another idea brought out at the meet-up is that a streaming capability can lead to batch processing more smoothly than trying to go the other way. Keep in mind that Flink can also do batch computation. A convenient aspect of using Flink is that it can deliver high throughput across a wide range of latencies from real time to batch.
As streaming architectures are increasingly pursued, people have some good choices among the available technologies. I’m excited about the prospect of watching how people will put Flink to use as well as how this project will further evolve. If you’re interested in getting more involved, visit the Apache Flink website, try subscribing to the user or developer mailing list and follow @ApacheFlink on Twitter. This project has the advantage of an open and enthusiastic community, and your input can be valuable.
There are chapters of the Apache Flink meetup in a number of locations in addition to the California Bay Area Flink group. These include Washington DC, New York and Berlin. And the first Flink conference, called Flink Forward 2015, will take place in Berlin 12-13 October.