Although cluster-based installations of Spark can become large and relatively complex by integrating with Mesos, Hadoop, Cassandra, or other systems, it is straightforward to download Spark and configure it in standalone mode on a laptop or server for learning and exploration. This low barrier to entry makes it relatively easy for individual developers and data scientists to get started with Spark, and for businesses to launch pilot projects that do not require complex re-tooling or interference with production systems.
Apache Spark is open source software, and can be freely downloaded from the Apache Software Foundation. Spark requires at least version 6 of Java, and at least version 3.0.4 of Maven. Other dependencies, such as Scala and Zinc, are automatically installed and configured as part of the installation process.
Build options, including optional links to data storage systems such as Hadoop's HDFS or Hive, are discussed in more detail in Spark's online documentation.
A Quick Start guide, optimized for developers familiar with either Python or Scala, is an accessible introduction to working with Spark.
Follow these simple steps to download Java, Spark, and Hadoop and get them running on a laptop (in this case, one running Mac OS X). If you do not currently have the Java JDK (version 7 or higher) installed, download it and follow the steps to install it for your operating system.
Visit the Spark downloads page, select a pre-built package, and download Spark. Double-click the archive file to expand its contents ready for use.
Open a text console, and navigate to the newly created directory. Start Spark's interactive shell:
A series of messages will scroll past as Spark and Hadoop are configured. Once the scrolling stops, you will see a simple prompt.
At this prompt, let's create some data; a simple sequence of numbers from 1 to 50,000.
val data = 1 to 50000
Now, let's place these 50,000 numbers into a Resilient Distributed Dataset (RDD) which we'll call sparkSample. It is this RDD upon which Spark can perform analysis.
val sparkSample = sc.parallelize(data)
Now we can filter the data in the RDD to find any values of less than 10.
sparkSample.filter(_ < 10).collect()
Spark should report the result, with an array containing any values less than 10. Richer and more complex examples are available in resources mentioned elsewhere in this guide.
Spark has a very low entry barrier to get started, which eases the burden of learning a new toolset. Barrier to entry should always be a consideration for any new technology a company evaluates for enterprise use.