The distributed computation world has seen a massive shift in the last decade. Apache Hadoop showed up on the scene and brought with it new ways to handle distributed computation at scale. It wasn’t the easiest to work with, and the APIs were far from perfect, but they worked. People tried using this platform as the proverbial hammer to build solutions covering every facet of business that had problems scaling at a reasonable cost. It didn’t take long for people to realize they should stop trying to use Hadoop MapReduce to solve every problem and to use the right tool for each problem. This was the beginning of the proliferation of new technologies. Or as I like to call it, the highly competitive age of distributed computation platforms.
These days, competition is the cornerstone of distributed computing. The most important factors underpinning distributed computation platforms are their reliability and high availability support. Once users know their platform is stable, they turn to topics like security and recoverability. There are two more factors that come into play. Which is most important to you will likely be based on you or your company’s experience in this space:
- Performance - This is an area that always bubbles to the top. Performance is typically the thing that gets people excited with any emerging technology. It is quite possibly the most important benefit when showcasing a technology to others.
- Ease of use - This can manifest itself via a specific protocol, API, or industry standard. Leveraging simple to understand APIs allows developers to rapidly test and adopt new technologies. After the “shininess” of the new toy wears off, this bubbles up to be the most important factor.
When evaluating new and emerging technologies in the distributed computation space, each of these capabilities listed are very important. However, I think that performance and ease of use are factors that should be at the top of everyone’s list. Consider Apache Spark—when it came on the scene, the first thing that people latched onto was its performance. Shortly thereafter, people were talking about how easy it was to get started with it and to start building ETL pipelines. Then, conversations quickly turned to the ability to do stream processing on the same platform with virtually the same APIs.
There are multiple options for stream processing within the Apache Foundation, including Storm, Spark Streaming, Samza, Flink, and Apex (incubating). This is quite possibly the most competitive space in distributed computing.
I have been speaking publicly on stream processing engines for well over a year now. Mostly my talks have revolved around use cases, processing models, and delivery guarantees, and ultimately nearly always ending up on performance and ease of use.
A year ago, Flink was out there and it wasn’t really getting a substantial amount of mindshare, as Spark has kept the industry abuzz. This last year, however, has been very kind to the Flink community. Flink has made tremendous strides forward in performance and ease of use.
Let’s discuss performance. When discussing stream processing engines, I like to remind people that making comparisons between engines without knowing the specifics that went into a specific benchmark tend to be fruitless. For example, exactly once is more expensive to implement than at-most once. Recently, an article was published that discussed a performance benchmark on Apache Flink that showed that it could handle 15 million events per second. In the article, it was pointed out that Flink’s exactly-once is superior in performance to Storm’s at-least-once. This goes to show that great performance can be accomplished with a great architecture.
Let’s discuss ease of use. This is an area where Flink really shines. Flink’s APIs are similar to Spark, which makes it easier for developers to learn Flink. If that were not enough, Flink also implements the Storm API. For anyone out there who has built applications on Storm, they can be run on Flink with relative ease.
Some of the more important features that help to reinforce Flink’s ease of use include:
- Support for event time and out-of-order events
- Exactly-once semantics for stateful computations
- Windows over time, count, or sessions, as well as data-driven windows
- Fault-tolerance via lightweight distributed snapshots
Flink General Availability
Just to be abundantly clear, I am still a fan of Apache Spark. Spark was founded on the notion of creating a better, faster, batch processing system. Shortly after that, stream processing was slapped on to it. While that is fine for some use cases, it is not real time, and it is not event-based. Apache Flink, on the other hand, was created as a stream processing (DataStream API) engine first, and then batch semantics (DataSet API) was added on top of its solid runtime foundation.
I’m really excited about the future of Flink. To help substantiate my excitement, Flink has just announced the 1.0 GA release. As a part of this 1.0 release, the public interfaces are guaranteed to remain compatible across minor versions (e.g., 1.x).
There are a number of new features being introduced in this release. While Flink already supports the Kafka 0.8 API, it is also adding support for the Kafka 0.9 API. This means it will work with both Kafka and MapR Streams.
If you didn’t think Flink could improve on its ease of use promise, this release also brings with it a Complex Event Processing (CEP) API. Here are some other new features that are important to consider in terms of ease of use:
- Improved checkpointing control and monitoring
- Savepoints and version upgrades
- Enhanced monitoring interface: job submission, checkpoint statistics, and backpressure monitoring
Flink currently supports running on top of YARN, as well as from Apache Myriad on Mesos. The Flink roadmap contains Mesos support. This is particularly exciting, as it will give users the ability to leverage Flink across their entire datacenter.
Competition is forcing each project in this space to become the best version that it can be. One year ago, when I started learning about Flink, I was impressed with their roadmap and the tenets of their platform, but Spark was the project garnering the dominant mindshare. Flink chose to accept that competitive challenge head on. Looking at the landscape now, it is clear that Flink's architectural foresight has paid off. This should come as great news to everyone in the big data community. In a technology space that seems quite crowded, Flink is blazing a new path. They have listened to the community and are delivering features that people want and need, and the other platforms are not quite keeping pace. As a matter of fact, I have little doubt that Flink is ready for primetime. Read Introduction to Apache Flink: Stream Processing for Real Time and Beyond to learn more.