Editor's Note: In this week's Whiteboard Walkthrough, Anoop Dawar, Senior Product Director at MapR, shows you the basics of Apache Spark and how it is different from MapReduce.
Here's video the transcript:
Hi – I'm Anoop Dawar. I am a Senior Director of Product Management here at MapR. Today, I'm going to talk to you about Spark and why it's important for Hadoop. If you look back, you will see that MapReduce has been the mainstay on Hadoop for batch jobs for a long, long time. However, two very promising technologies have emerged over the last year, Apache Drill, which is a low-density SQL engine for self-service data exploration and Spark, which is a general-purpose compute engine that allows you to run batch, interactive and streaming jobs on the cluster using the same unified frame. Let's dig a little bit more into Spark.
To understand Spark, you have to understand really three big concepts. One is RDDs, the resilient distributed data sets. This is really a representation of the data that's coming into your system in an object format and allows you to do computations on top of it. RDDs are resilient because they have a long lineage. Whenever there's a failure in the system, they can recompute themselves using the prior information using lineage. The second concept is transformations. Transformations is what you do to RDDs to get other resilient RDDs. Examples of transformations would be things like opening a file and creating an RDD or doing functions like printer that would then create other resilient RDDs.
The third and the final concept is actions. These are things which will do where you're actually asking for an answer that the system needs to provide you, for instance, count or asking a question about what's the first line that has Spark in it. The interesting thing with Spark is that it does lazy elevation which means that these RDDs are not loaded and pushed into the system as in when the system encounters an RDD but they're only done when there is actually an action to be performed. Let's walk through this example one time. Here in this first step we are reading a text file and creating an RDD. Notice at this point this is simply a specification that allows Spark to create a directory graph that tells it that it needs to get data from this file and push it into this representative RDD format.
The second step is to take this text file and apply a printer which is really looking for Spark in every line of the text file. Again, this is simply a specification of what the system needs to do. Then the desertant RDD is in other lines with Spark RDD. Notice at this point after Step 1 and Step 2 that Spark may not have actually brought data or computed anything at all. Then we come to an action where we ask for account of the number of lines with Spark. This now triggers Spark to go look back at the direct areas I picked out and place it in an optimized manner on the Hadoop cluster. Unlike MapReduce which was constrained to MapReduce status, a complex RDD graph can be placed in the most optimized manner on the Hadoop cluster.
One thing that comes up with RDDs when we come back to them being that they are resilient and in main memory is that how do they compare with distributed shared memory architectures and most of what are familiar from our past rights? There are a few differences. Let's go with them in a small, brief way. First of all, rights in RDDs are core screen. They are happening at an RDD level. As you notice, there is a full RDD assigned here and another one assigned here. Rights in distributor-shared memory are typically fine-grained. Reads and distributor-shared memory are fine-grained as well. Rights in RDD can be fine or course-grained.
The second piece is recovery. What happens if there is a part in the system, how do you recover it? Since RDDs build this lineage graph if something goes bad, they can go back and recompute based on that graph and regenerate the RDD. Lineage is used very strongly in RDDs to recovery. In distributor-shared memories we typically go back to check-pointing at PDR in intervals or any other semantic check-pointing mechanism. Consistency is relatively trivial in RDDs because the data underneath it is assumed to be immutable. If, however, the data was changing, then consistency would be a problem here. Distributor-shared memory doesn't make any assumptions about mutability and, therefore, leaves the consistency semantics to the application to take care of.
Let's summarize the benefits of Spark. Spark provides full recovery using lineage. Spark is optimized in making computations as well as placing the computations optimally using the directory cyclic graph. Very easy programming paradigms using the transformation and actions on RDDs as well as a ready-rich library support for machine learning, graphics and recently data frames. At this point a question comes up. If Spark is so great, does Spark actually replace Hadoop? The answer is clearly no because Spark provides an application framework for you to write your big data applications. However, it still needs to run on a storage system or on a no-SQL system.
With MapR, you can have Hadoop-based applications and your Hadoop applications running on a single cluster and can run Spark on top of that. The other question that comes up is how is Spark support different on MapR compared to other commercial distributions? MapR is the only commercial distribution that supports the complete Spark stack, including Spark SQL and GraphX. MapR supports both the stand-alone and YARN mode of running Spark on the cluster. It's supported on all editions of MapR, the community edition, the enterprise edition, as well as the enterprise database edition.
The next question that comes up is how is Spark and MapR different or better than running Spark on other distributions? If you take a step back and recognize what running MapR Hadoop distribution provides you, you get unlimited scale and scale to billions of files, thousands of notes, without any metadata scalability issues. You have an enterprise-based platform with high availability, durability, disaster recovery and data protection. You have a wide range of applications with various SQL engines, Hive, Pig and several other open-source components.
Spark provides ease of deployment, in-memory performance and allows you to combine that in directive and streaming workflows. When you take both of these together, you can run your operational and analytical workloads on a single cluster in a highly-performing, scalable, beautiful way while still providing ease of programming to your developers and data analysts. Thank you for watching!
As always, if you have any questions please ask them in the comments section below!