You already know Hadoop as one of the best, cost-effective platforms for deploying large-scale big data applications. But Hadoop is even more powerful when combined with execution capabilities provided by Apache Spark. Although Spark can be used with a number of big data platforms, with the right Hadoop distribution, you can build big data applications quickly using tools you already know.
What is Apache Spark?
Apache Spark is a general-purpose engine for processing large amounts of data. It’s designed to allow developers to develop big data applications quickly. Spark’s distinguishing feature is its Resilient Distributed Datasets (RDDs). This data structure can either be stored in memory or on the disk. Having the objects live in memory offers a substantial performance boost, since your application doesn’t have to waste time fetching data off of a disk. If you have a large cluster, your data might be spread across hundreds, even thousands, of nodes.
Not only is Apache Spark fast, it’s also reliable. Spark is designed to be fault-tolerant, able to recover from data loss due to, for instance, node or process failure. You can use Apache Spark with any file system, but with Hadoop, you’ll get a reliable, distributed file system that will serve as the base for all your big data processing.
Another major source of efficiency in developing big data applications is in the human element. The development tools make the job more complicated than it already is, but Apache Spark gets out of the programmer’s way. There are two keys to using Apache Spark for rapid application development: the shell and the APIs.
One of the greatest benefits of scripting languages is their interactive shells. Going all the way back to the early days of Unix, shells let you try out your ideas quickly without being slowed down by a write/test/compile/debug cycle.
Have an idea? You can try it and see what happens now. It’s a simple idea that makes you more productive on a local machine. Just wait and see what happens when you have access to a big data cluster.
Spark offers either a Scala or a Python shell. Just pick whatever you’re most comfortable with. You can find the Python shell at ./bin/pyspark and the Scala shell at ./bin/sparkshell in the Spark directory on Unix-like systems.
Once you’ve got the shell up and running, you can import data into RDDs and perform all kinds of operations on them, such as counting lines or finding the first item in a list. Operations are split into transformations, which create new lists from a set, and actions, which return values. You can also write custom functions and apply them to your data. These will be Python methods for the RDD object you create.
For example, to import a text file into Spark as an RDD in the Python shell, type:
textfile = sc.textFile(“hello.txt”)
Here’s a line counting action:
Here’s a transformation that returns a list that with lines that contain “MapR”:
textFile.filter(lambda line: "MapR" in line)
While Spark itself is written in Scala, you can use APIs to make your job easier. If you’ve been using the Python or Scala shells, you’re already using the APIs for those languages. All you have to do is save your programs into scripts with very few changes.
If you’re looking to build something more robust you can use the Java API. Even if you ultimately end up implementing your program in Java, you can still sketch out your ideas in the shell to make sure you’ve got your algorithms right before deploying to your cluster.
You can build complex applications using some easy to use APIs and deploy them in real time. You can even build applications or big data pipelines that mix and match technologies, such as an application that builds a graph out of the machine learning results. .
The power and flexibility that Apache Spark, backed by Hadoop platform, offers is obvious. With MapR distribution that supports the full Spark stack, it’s possible for a programmer to create a complex big data application easily across real-time as well as batch data...
The world moves fast. With all of the data your business is accumulating, you need a way to churn through it quickly. While you can build big data clusters to try to sift through it, you need the right tools—tools that are designed to process large amounts of data, and quickly. While Spark, running on Hadoop, can do that, the biggest advantage is in developer productivity. By using rapid Scala and Python with Spark, you can do so much more in much less time. You and your developers can go where your big data ideas take you.