M.C. Srivas, CTO and Co-Founder of MapR Technologies, spoke recently at Spark Summit 2014 on “Why Spark on Hadoop Matters.” Spark, with an in-memory processing framework, provides a complimentary full stack on Hadoop, and this integration is showing tremendous promise for MapR customers. M.C. Srivas presented several of these use cases and discussed how and when the integration of Spark and Hadoop delivers the best value for the end user.
A few key points from this talk include:
Apache Hadoop and the OSS ecosystem is rapidly evolving. MapR provides many parts of the Hadoop ecosystem as part of the MapR Data Platform.
There are many advantages to Spark, including ease of development due to easier APIs and a strong support of Python, Scala and Java. In addition, Spark writes to RDDs (Resilient Distributed Datasets), which live in-memory and solve the problem of reliability. Another innovative feature of Spark is its support of in-memory data sharing across DAGs (Directed Acyclic Graphs), which makes it possible for different jobs to work with the same data very quickly. The Spark library includes many different processing models such as graph processing, MapReduce, SQL and machine learning, and all of these models can be unified using DAGs.
Hadoop brings several complementary advantages to Spark, such as unlimited scale, meaning you can accommodate multiple data sources, applications and users. Hadoop has grown into a multi-tenant, reliable, enterprise-grade platform with a wide range of applications that can handle files, databases, and semi-structured data. The combination of Spark and Hadoop makes it possible to take what was traditionally batch processing on Hadoop and move it to operational applications that are augmented by in-memory processing. Additionally, you can easily combine different kinds of workflows to immediately gain insights into the data.
Use cases for Spark include:
An industry-leading ad targeting platform uses MapR M7 to perform over 100 billion real-time auctions on their global transaction platform per day, which translates to about 3.5 petabytes of data that needs to be managed and analyzed. The company loads the data from the M7 tables into RDD to augment scoring in real-time.
A leading pharmaceutical company uses Spark to improve gene sequencing analysis capabilities, resulting in faster time to market. Before Spark, it would take several weeks to align chemical compounds with genes. With ADAM running on Spark, gene alignment only takes a matter of a few hours.
Cisco is running their Security Intelligence Operations based on MapR M7 and Spark. Sensor data streams in on M7, and Spark streaming is used to run a first check on known threats. Next, the data is processed on GraphX and Mahout to correlate the data, and the results are queried using SQL via Shark and Impala.
A health insurance company is using M7 to store patient information, which is combined with clinical records to compute re-admittance probability.
To summarize, Spark on Hadoop gains traction for real-time applications. Since it’s important to pick the right tool for the right job, MapR gives you the entire Hadoop stack. Whether you want Spark, Shark, Impala, Drill, or Hive/Tez, MapR provides them simultaneously on the same cluster.
Want to learn more about MapR, Spark and our customers? Check out: