At the Spark Summit 2015 conference held in San Francisco, Anil Gadre, Senior Vice President of Product Management for MapR, presented a featured keynote titled "Spark & Hadoop at Production Scale" where he highlighted how leading companies are deploying Spark with Hadoop in production. During his talk, he shared real-life customer examples of turning data into action using Spark and Hadoop, and he also discussed how advanced users are deploying Hadoop and Spark applications in one cluster with better reliability and performance at production scale. Below is his talk. - Michele Nemschoff
You are probably all somewhere on the Spark journey to production scale—you're either at Spark Summit to learn, to start doing something with Spark, or perhaps you have mission-critical applications already running in your enterprise. On this journey, there's a lot to think about— mostly about your application—but you also need to figure out how to actually get Spark into production scale as more and more groups will want the power of the results and the value of using Spark in mission-critical, operational deployments.
4 Keys to Success
The good news is that lots of companies are already doing this at large scale. The common thread among these companies is that success comes down to four things:
- Full Spark stack support. Lots of organizations are taking Spark into large mission-critical, enterprise-grade applications. At MapR, we have examples from healthcare, government, telecom, and financial services industries–all in large-scale environments. The common thread is full Spark stack support. You can go get the componentry and try to stitch it all together, but what you really need is a full, complete Hadoop Distribution that comes with everything you need to make life easy.
- The underlying distribution should truly enable real-time performance. You obviously want to try to increase velocity. Is the architecture of the core underlying distribution something that, in fact, enables the real-time performance that you're looking for?
- Enterprise-grade reliability and security is key. As soon as your application grows, becomes mainstream and people start depending on it, IT and other folks who are in the deployment business start to care a lot about enterprise reliability and security.
- Your solution must include open-ended agility. This environment is moving and changing quickly. Is your solution open-ended, and is it really agile enough to keep up with new things that are going to keep coming every six months?
Cisco, Quantium, Novartis and Razorsight, all using Spark in production today
Cisco runs a very large, complex network monitoring and thread detection system as a managed service around the world. They use Spark streaming to process the incoming messages in real time. They use all the components of the full Spark stack. Here's an example of an organization that needed the ability to run the full stack, run it in real time, with fast ingestion to mount the entire cluster using NFS (which our Hadoop distribution uniquely makes possible. That's one of the main reasons why people find it so much easier, just in the basics of ingesting the data. They’re also using Spark SQL here to process and interrogate the data. You can tell this is a mission-critical application for Cisco—It has to keep up, and it has to deliver the kind of SLAs that they've got for their customers and internally.
Quantium is a company in Australia who is building the largest big data cluster in Australia. They are a service provider that provides analytic services to a variety of industries, ranging from retail to banking and to telecoms and other industries. Quantium has made it easy for their end users to access analytics in the cloud. They needed the ability to access their customers' business systems data, ingest it (via NFS), have full Spark support, and have the ability to provide interactive or near interactive performance to the end user who typically is a line of business user. They do not need to write code; they are using graphical tools and dashboards. Quantium realized that what they needed to do was not just meet a higher level of SLA, which, by the way, if they can deliver better off SLAs, it's actually a competitive advantage for them.
Novartis is a key player in genomics and drug design. Their models keep changing faster than they are able to keep up with. The faster they are able to be agile with delivering new and changing models to their researchers, the faster the entire process of medicinal discovery. Again, they used NFS ingestion to reduce ETL processing time. By getting the data quickly through the system, researchers were more productive and able to rely on it on a mission-critical basis. In addition, they have multi-tenant requirements because their different departments all used a shared cluster, and those departments have confidential data that needed to be kept firewalled from each other—and this needed to be built into the infrastructure.
You can see the theme building here: the three examples so far have all been about multiple customers. Therefore, they need multi-tenancy, fast ingestion rates, and the ability to use Spark and Hadoop in mission-critical situations.
Razorsight serves the needs of a variety of different customers. They create analytical models for customers–who are business users–who tap into these and use them on a day-to-day basis. They use the entire Spark stack for fast ingestion and reduced ETL. They went from a world of batch ETL to much more of an “ingest and go” real-time world. By the way, they found that they incurred about one-tenth the total cost of ownership compared to doing it with classical enterprise data warehousing technology. By the way, the legacy technology could never have gotten to the real-time requirements that they now had. The time to onboard a new customer dropped from months to weeks, which is a huge advantage for them in terms of just making it easier for end users to gain value from big data. I point this out because you're probably either you're creating an application for use inside your own company, or you’re a service provider creating applications that other end users are going to need—the same themes show up over and over.
Databricks and MapR: A Strategic Partnership
We are really proud about the partnership that we have at MapR with Databricks. The partnership involves a complete arrangement for one-stop phone calls for support issues, and includes support for the complete Spark stack, as well as engineering and roadmap collaboration.
MapR: The Most Complete Spark Environment
One of the things that makes it easy is you are able to get the entire componentry as a part of our distribution. By the way, we have a completely free distribution, the MapR Community Edition, you can download and use in production. Of course, there is a paid version, the MapR Distribution for Apache Hadoop, which has enterprise-grade support and features such as disaster recovery and mirroring and snapshots. In fact, getting mission-critical capability is why most of our customers pick us, with the second reason being able to use NFS; it makes life so much easier.
MapR: Operations + Analytics on One Hadoop Platform
One of the things that we're seeing the leading-edge customers do is increasingly move away from the notion of, "Well, I've got my analytics cluster and I've got my operational cluster” and move it into one cluster. The MapR Distribution makes it possible in an integrated fashion to have a single cluster that really pulls all this together. That's one of the reasons why the real-time capability coupled with the enterprise grade, mission-critical capability is possible.
Spark + MapR: Ready for Production Success
We're really proud of what we're doing with Databricks in terms of trying to take away the headache. We are bringing Hadoop and Spark together and making it possible for you to have an infrastructure that you can just rely on and just get data into and out of pretty fast.
3 New Spark-Based Quick Start Solutions
We have three new MapR Quick Start Solutions. One is for real-time security log analytics. Another is around time series analytics, and the third is for genome sequencing. They're designed to address the problem with big science projects that last a year and who knows how much they cost. They're trying to get you to quickly have value.
Apache Drill: The Industry’s First Schema-Free SQL Engine for Big Data
I'd also like to point you to a great Apache project – Apache Drill. You can go download this and try it. This is a self-describing schema generator—you don't have to know what's in the data. I mention it here only because typically with big data, it often starts with "I don't actually know what to do with it." Using Apache Drill, you can point it to a file, it will self-discover schema, and start helping users in your organization to figure out what kind of use cases they might have the ability to dive into.
Take Advantage of Free, On-Demand Training
We also have a set of free courses on basic Hadoop cluster management skills online—go check it out! Spark courses will be coming soon.
Want to learn more about Spark? Check out these resources:
- Apache Spark
- The Essential Apache Spark Cheat Sheet
- Webinar - Spark + Hadoop: Your Formula for Data-Driven Success
- Enterprise-Grade Spark: Leveraging Hadoop for Production Success
- Forrester Report: Apache Spark is Powerful and Promising
- Official Apache Spark site
- ebook: Getting Started with Apache Spark — Chapter 1: What is Spark?