Editor's Note: Here is the video transcript:
Hi, welcome to MapR Whiteboard Walkthrough sessions. My name is Abhinav and I'm one of the data engineers here at MapR, and the purpose of this video is to go through the comparison of Storm Trident and Spark Streaming. As you may be aware, Storm and Spark are very popular projects within the community. Storm is a stream processor that came out from Twitter in 2009, and Spark is a general purpose, in-memory processing framework, both of which offer stream processing solutions.
There is no comparison or contrasting available right now because Spark Streaming is a fairly new project. Having said that, both of those projects are very capable, and today in this whiteboarding session, we'll go over certain points that will allow you to pick and choose the solution that fits your applications. The points that I want to cover today have to do with fault tolerance, ease of deployment, the language choices that are available to you, and how you make sure that it is compatible with YARN. Both of the frameworks are on a streaming architecture, meaning there is an inbound, infinite list of tuples.
In Storm nomenclature, it's called a stream that is partitioned into finite-sized tuples. In Spark nomenclature, the stream is called a D-Stream or a discretized stream. Spark Streaming is a framework that uses the Spark data processing engine essentially to discretize and convert the stream into finite size RDDs, which are essentially microbatches used to process messages. Both of them allow stateful processing, meaning if you ever lose a worker node or driver node, then you can recover from that state and whatever data came in during that time duration will be replayed. It guarantees exactly-once semantics, which is very important in maintaining time-based averages or message counts when you are performing trend analyses or making recommendations.
RDD is the cornerstone for Spark. RDD stands for Resilient Distributed Dataset, and that essentially entails finite-sized immutable data structures which can be replayed as needed and are replicated over HDFS or any type of persistent file system. You might ask, "How is state maintained within Storm and Spark?" Well, the answer depends on you. Storm allows you to plug in memchached or HBase or any other type of resilient data store that either uses I/O or any type of persistent storage so as you can automate this state maintenance.
Storm has transactional spouts as well as bolts that guarantee exactly-once semantics, but they're different types of spouts. There are opaque spouts, there are transactional bolts—there are way too many details to cover in one video—but we'll provide you the links at the bottom of this video.
There is a notion of worker nodes within Storm that are the actual workhorses. The equivalent notion in Spark would also be a worker node. You might ask, "How would you achieve fault tolerance?" Both of these streaming frameworks allow you to maintain fault tolerance using an external data store as I described before. The semantics change a little bit. The next point to cover is the type of programming languages that are available. Both of these frameworks support Java or any JVM-based language. Spark is written in Java as well as in Scala, and Storm uses Java and Clojure. If you're familiar with any JVM-based languages, you will find yourself being in the home zone, or the directory that you're familiar with, so the language choice doesn't really matter.
One advantage when you're choosing Spark Streaming is that the same code that you write for general-purpose Spark applications can be very easily ported, so your codebase maintenance becomes very easy. In terms of functionality, both Trident and Spark offer microbatches that can be constrained by time. Functionality-wise, they're very equivalent, but implementation-wise, there are different semantics. So if you ask, "Which framework should I choose for my application?", the answer is that it depends on your application's requirements.
If you require strict, stateful processing, you can choose Spark Streaming because it guarantees exactly-once semantics, or you can choose Storm Trident. Keep in mind that Storm Trident is a fairly new project, and when you're maintaining states within bolts and spouts, it's going to have some sort of performance degradation, and the exact magnitude can be gauged only when you're benchmarking your application. In terms of development effort, if you already have familiarity with Spark, then you can just use Spark Streaming, because the learning curve is extremely small.
How do these frameworks work on YARN? Well, Spark and Storm Trident both offer their application master, so you can essentially co-locate both of these applications on a cluster that is running YARN. You can dynamically scale, it offers security, and it's a very hassle-free deployment model. We'll provide you the link pertinent to the topic at the bottom of the page. That's all, folks. If you have any questions, you can follow us @MapR with the hashtag #WhiteboardWalkthrough. Thanks a lot for your time. I hope you find this video very helpful!
Want to learn more? Check out these resources: