Apache SparkTM Streaming

When Hadoop first emerged, it provided a platform to store petabytes of data, and perform batch queries on that data to gather insights. This model works well for many use cases, like analyzing vast amounts of customer data for interesting patterns. However, not all data can wait for a batch query to be performed. Spark Streaming brings streaming computation to Hadoop, meaning that processing occurs in real-time on data is streamed from a source, allowing many new use cases to be performed, including:

  • Credit card fraud detection: Each time a credit card is swiped the issuer needs to ensure the request is legitimate. Spark Streaming can be used to quickly analyze the request against predefined fraud models, taking into account past behavior of both the consumer and the merchant.
  • Real-time ad matching: Big data has brought a whole new level of customization to online advertising. Today, when users visit a web page, behind the scenes an auction takes place between many advertisers to determine the best ad to show based on the demographic of the user. Spark Streaming is ideal for this type of time-sensitive processing.

 

Spark Streaming is an extension of the core Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ or TCP sockets and processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to file systems, databases, and live dashboards. Since Spark Streaming is built on top of Spark, users can apply Spark's in-built machine learning algorithms (MLlib), and graph processing algorithms (GraphX) on data streams. Compared to other Hadoop streaming projects, Spark Streaming has the following features and benefits:

  • Ease of Use: Spark Streaming brings Spark's language-integrated API to stream processing, letting users write streaming applications the same way as batch jobs, in Java, Python and Scala.
  • Fault Tolerance: Spark Streaming is able to detect and recover from data loss mid-stream due to node or process failure.