Why Apache Spark Is Like a Fighter Jet

At Strata London in 2015, someone said to me, “Spark is like a fighter jet that you have to build yourself. Once you have it built, though, you have a fighter jet. Pretty awesome. Now you have to learn to fly it.” Let’s break down this quote to see the value it serves in discussing Spark (read more in the ebook Getting Started with Apache Spark: From Inception to Production).

How are fighter jets relevant to Spark? Well, building scalable applications can be difficult. Putting them into production is even more difficult. Spark scales out of the box nearly as simply as it installs. A lot of work has gone into the thoughts and concepts of Spark as a scalable platform.

Spark is powerful not only in the terms of scalability, but also in ease of building applications. Spark has an API that is available in multiple languages and allows nearly any business application to be built on top of it. However, just because Spark can scale easily doesn’t mean everything written to run in Spark can scale as easily.

In this blog post, I will talk about the importance of really understanding the software and technology in order to use it properly. In addition, I will discuss how to plan for the coexistence of Spark and Hadoop, and why it’s critical to have a fine-grained, intuitive monitoring tool that can provide a wide, macro view of your system.

Learning to Fly

Even though the API is nearly identical or at least similar between languages, this doesn’t solve the problem of understanding the programming language of choice. A novice programmer may be able to write Java code with minimal effort, but it doesn’t mean he or she understands the proper constructs in the language to optimize for a use case.

Let’s consider analytics in Python on Spark. While users may understand Python analytics, they may have no experience with concepts like predicate movement, column pruning, or filter scans. These features could have significant impact when running queries at scale. Here are a few other topic areas in which people may misunderstand how Spark works by drawing on experiences with other technologies:

  • Spark supports MapReduce, but people with a lot of experience with Hadoop MapReduce might try to use concepts that don’t necessarily translate over to Spark, such as functional programming constructs, type safety, or lazy evaluation.
  • Someone with database administration experience with any popular RDBMS system may not think of partitioning and serialization in terms that would be useful in Spark.

Another problem scenario is trying to run multiple use cases on a single Spark cluster. Java Virtual Machine configuration settings for a Spark cluster could be optimized for a single use case, but the same Spark cluster may not be optimized with the same settings for deploying an alternate use case. However, with technologies like Mesos and YARN, this shouldn’t be a real problem, as multiple Spark clusters could be deployed to cover specific use cases. It might even be beneficial to create an ETL cluster and perhaps a cluster dedicated to streaming applications, all running on the same underlying hardware.

While these examples are not intended to be exhaustive, they hopefully clarify the concept that a person must fully understand any given language or platform if he or she wants to get the most from it. A person must “learn to fly”—in other words, he/she must be able to truly understand the software and technology in order to use it to its fullest capacity.  

Fighter Jets and Battleships: Planning for the Coexistence of Spark and Hadoop

Spark can fly solo, but it is more commonly deployed as part of a cluster managed by Mesos or the YARN resource manager within Hadoop. Spark should always run as close to the cluster’s storage nodes as possible. Much like configuring Hadoop, network I/O is likely to be the biggest bottleneck in a deployment. Deploying with 10Gb+ networking hardware will minimize latency and yield the best results. Never allocate more than 75% of available RAM to Spark. The operating system needs to use it as well, and going higher could cause paging. If a use case is severely limited with 75% of available RAM, it might be time to add more servers to the cluster.

From the Control Tower: Reliability and Performance through Monitoring

As more organizations begin to deploy Spark in their production clusters, the need for fine-grained monitoring tools is becoming paramount. Having the ability to view Spark resource consumption and monitor how Spark applications are interacting with other workloads on your cluster can help you save time and money by:

  • Troubleshooting misbehaving applications
  • Monitoring and improving performance
  • Viewing trend analysis over time

When deploying Spark in production, here are some crucial considerations to keep in mind:

Are You Monitoring the Right Metrics? 

How granular is your visibility into Spark’s activity on your cluster? Can you view all the variables you need to? These are important questions, especially when troubleshooting errant applications or behavior.

With an out-of-the-box installation, Spark’s Application Web UI can display basic, per-executor information about memory, CPU, and storage. By accessing the web instance on port 4040 (default), you can see statistics about specific jobs, such as their duration, number of tasks, and whether they’ve completed.

However, this default monitoring capability isn’t necessarily adequate. Take a basic scenario: suppose a Spark application is reading heavily from disk, and you want to understand how it’s interacting with the file subsystem because the application is missing critical deadlines. Can you easily view detailed information about file I/O (both local file system and HDFS)? No, not with the default Spark Web UI. But this granular visibility would be necessary to see how many files are being opened concurrently and whether a specific disk is hot-spotting and slowing down overall performance. With the right monitoring tool, discovering that the application attempted to write heavily to disk at the same time as a MapReduce job could take only seconds instead of the minutes or hours it would have taken with basic Linux tools such as Top or Iostat.

These variables are important, and without them you may be flying blind. Having deep visibility helps you quickly troubleshoot and respond in-flight to performance issues. Invest time in researching an add-on monitoring tool for Spark that meets your organization’s needs.

Is Your Monitoring Tool Intuitive? 

It’s great to have lots of data and metrics available to digest, but can you navigate that data quickly? Can you find what you need and, once you do, make sense of it? The way quantitative information is displayed makes a difference. Your monitoring tool should allow you to easily navigate across different time periods and zoom in on a few seconds’ worth of data. You should have the option to plot the data in various ways: line charts, by percentile, in a stacked format, and so on. Note whether you can filter the data easily by user, job, host, or queue. In short, can you use your monitoring tool intuitively and complement your mental line of questioning, or do you have to work around the limitations of what the tool presents?

If you have all the data but can’t sort it easily to spot trends or filter it quickly to drill down into a particular issue, then your data isn’t helping you. You won’t be able to effectively monitor cluster performance or take advantage of the data you do have. So make sure your tool is useful in all senses of the word.

Can You See Global Impact?

Even if you are able to see the right metrics via an intuitive dashboard or user interface, being limited in vantage point to a single Spark job or node view is not helpful. Whatever monitoring tool you choose should allow you to see not just one Spark job but all of them—and not just all your Spark jobs but everything else happening on your cluster, too. How is Spark impacting your HBase jobs or your other MapReduce workloads?

Spark is only one piece in your environment, so you need to know how it integrates with other aspects of your Hadoop ecosystem. This is a no-brainer from a troubleshooting perspective, but it’s also a good practice for general trend analysis. Perhaps certain workloads cause greater impact to Spark performance than others, or vice versa. If you anticipate an increase in Spark usage across your organization, you’ll have to plan for it differently.

In summary, the reliability and performance of your Spark deployment depends on what’s happening on your cluster, including both the execution of individual Spark jobs and how Spark is interacting with (and impacting) your broader Hadoop environment. To really understand what Spark’s doing, you’ll need a monitoring tool that can provide both deep, granular visibility and a wider, macro view of your entire system. These sorts of tools are few and far between, so choose wisely.

Summary

In this blog post, I discussed the importance of really understanding the Spark software and technology, planning for the coexistence of Spark and Hadoop, and having access to a fine-grained monitoring tool that can provide a wide view of your system.

If you have any additional questions about using Spark, please ask them in the comments section below.

Ready to fly with Spark? Check out these additional resources to help you get started.

no

CTA_Inside

Ebook: Getting Started with Apache Spark
Interested in Apache Spark? Experience our interactive ebook with real code, running in real time, to learn more about Spark.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams

 

 

 

Download for free