At the Strata + Hadoop World 2015 conference held in San Jose, Ted Dunning, Chief Application Architect for MapR, gave an exciting talk titled “YARN vs. MESOS: Can’t We All Just Get Along?” where he showcased how YARN and MESOS can work together to seamlessly share datacenter resources.
There were five major parts to Ted’s talk:
- Why We Need Global Management of Datacenter Resources
- How Myriad Works
- Use Cases
- Lessons Learned
- The Future
The bulk of his talk is focused on the Myriad Project, which is a new open source/open community project that started as a collaboration between Mesosphere, MapR and eBay. The Myriad Project became an incubator project on March 1st of this year. The goal of the project is to be a global resource management tool for multiple data centers. Here is the text of his talk.
Why Do We Need Global Management of Datacenter Resources?
We need tight integration between the resources that we have and the programming models that we’re using. We really need a global view of what the resources need to be, and a global specification of those kinds of resources. In addition, we need user-specified resources; we need complete flexibility in terms of which kinds of resources we are going to schedule. We also want a lightweight executive, but we also want very strong isolation. And while we want really strong isolation, we also want lightning-fast task launch capabilities.
Other things that we need include fast scheduling, meaning that the scheduler needs to make decisions instantly. Otherwise, it makes no sense to run a 2-second elapsed time program if it takes 10 seconds to schedule (like it has in Hadoop historically). Conversely, at the same time, we want very careful (slow) scheduling – we want scheduling that does projections about likely future resource consumption of programs, and then optimizes where they’re placed. We have a complete contradiction here between needing very fast and very careful scheduling. We also need to be able to schedule both long-lived and short-lived system tasks, as well as long-lived ephemeral tasks, which are tasks that can be killed at any time. We also need pre-emption, and we need prevention of pre-emption. So you can see that there are a lot of contradictions in terms of what we want to do.
What Do We Need?
In addition, we need very good support of the entire Hadoop eco-system, which includes tight integration of MapReduce2, Tez, Impala, Drill and Spark. But you need to remember that the official meaning of Hadoop is YARN, HDFS, and MapReduce. As I said, MapReduce for new applications is not very interesting, and HDFS is somewhat less interesting, because we provide our own data layer. That leaves YARN, which is what’s left—it’s new and interesting. But when people think of Hadoop, they almost always mean a much broader idea—they’re talking about the things that are associated with Hadoop. It’s like saying “Russia” instead of “Moscow.” So Hadoop in the vernacular means all of the things that are associated with Hadoop, such as Tez, Impala, Drill and Spark.
We also need good support for all of the non-Hadoop communities, such as arbitrary containers, web servers, systems processes without containers, user-defined containers, and licensing constraints.
These contradictions are a problem on almost every level on the specification of what we need. With almost every aspect of what we need, we need the exact opposite at the same time. When I’m talking about “need” I mean what we need in order to have a datacenter-wide, industrial-scale, business-friendly framework. These inherent contradictions provide us both with a problem, and a huge opportunity. That opportunity is Myriad. Myriad is what we’re going to use to provide a solution to both sides of these contradictory needs—it’s an opportunity for synthesis.
What We Have – YARN
What we have in terms of YARN are:
- A ResourceManager and a NodeManager that have a heartbeat architecture. That means that the remote procedure calls that these things use to communicate tend to be unidirectional. In order to allow the chance for reverse communication, that unidirectional communication is done periodically with what’s called a heartbeat. During the heartbeat, the reply to the heartbeat can be requests that are going in the upstream direction. This heartbeat architecture is directly inherited from the grand Hadoop tradition of batch-only programming, and it leads to some very slow response times. But it also is an eminent lineage, because it leads back to a lot of optimizations for very high utilization of data centers that are running Hadoop-like tasks.
- Application Master, Task Containers: We’ve inherited a lot of code from the JobTracker and the TaskTracker of MapReduce1, and we now have the ResourceManager and NodeManager, and we’ve split the remaining specialized parts into application managers and task containers for specialized versions of this. Note that we’ve inherited some of that tradition, and some of those assumptions.
- Monolithic scheduling: One of the big ones is monolithic scheduling. Monolithic scheduling says, “This is how it’s done.”
- Preemption: With YARN, we also get preemption.
- Hadoop standard: We get really good integration with the Hadoop community. This is central to the Hadoop community, and it’s completely natural that we’d have that.
- Pre-defined resources: We also get pre-defined, rigid resources. It takes months to propose that there be a new way of scheduling anything. We contributed label-based scheduling, and there was another group that supported that as well. It takes a long time to do that—it’s not facile and it isn’t private.
- Good Hadoop eco support: Myriad also has good support from the Hadoop ecosystem, including MapReduce2, Tez, Impala, Drill and Spark.
What We Have – Mesos
Mesos is an older system than YARN. It is more advanced than YARN in some ways, and less presupposing than others. The idea which is better or which is more advanced is very hard to answer because they have different goals, origins, and methodologies.
- Two-level scheduling: Mesos does not have a monolithic schedule. It has two-level scheduling that lets you define your resources and how things are scheduled, and it also lets you define how things work. The bottom level of that is application-specific, although there are frameworks that help you.
- Actor-based, bidi RPC: It's actor-based, which is a style of coding that allows bi-directional procedure calls between components. That means it can do things very quickly. That's because the source of some inclination, some need, can directly request that something happen, and this leads to extremely fast process launches—almost on the scale that you're used to on a single machine.
- Marathon, Chronos: It has excellent and mature support in the form of Marathon for long running processes, Chronos for a Cron replacement, and it has support for ISO8601 repetition. Mesos also includes support for JBoss, Jetty, Sinatra and Rails.
- Mesos also includes user-defined resources and attributes
- Some Hadoop (Spark native) – It has good integration with some Hadoop components, notably Spark. Spark came from the same community that Mesos came from.
Myriad Integrates Mesos and YARN
Again, we have two very different cultures; two very different sets of virtues. They sound very much the same in some respects, especially to an outsider, but they are very, very different. Those virtues meet those contradictory requirements that we have in some really cool ways. Myriad is intended to integrate the virtues of each of these. The cool thing about Myriad is that it integrates these needs.
How Does Mesos Work?
Mesos is, to some extent, either on the top or on the bottom, depending on how you draw the diagram. If it's fundamental, it’s on the bottom; if it’s overarching, it's on the top, but is in some sense the prime mover. It is the thing that first starts stuff. One of the things it can start is a YARN cluster, or a set of web servers.
Mesos creates virtual clusters - Now once it starts a YARN cluster, the YARN cluster itself can schedule whatever resources Mesos has given it. This has many of the virtues of Mesos, because it has very broad industrial capabilities, but it also then inherits many of the virtues of YARN for running Hadoop or any programs. It's a really simple idea. YARN can run under this. It can get resources from Mesos, and it can release them back. You can tune up one cluster, and you can tune down another.
Here's a picture of roughly how it works:
An application makes a request to YARN, but YARN has a plug-in, which is a Mesos scheduler that talks to Mesos. It requests resources from Mesos. Those resources then are node managers, which YARN thinks of as ordinary YARN executors, and then that executes containers, which actually have Hadoop-oriented tasks. This is the basic idea that there are shim layers that allow each side of the fence to ignore what's going on there, but maintain the virtues. Mesos can also be now scheduling all kinds of other different things in addition to YARN.
How Myriad Works
Mesos runs YARN – There are some interesting issues there, in that you still need persistence layers and other things, and the basic way it runs is, again, Mesos runs YARN, YARN runs YARN programs, you can have multiple YARN clusters in the same environment at the same time, so it's very easy to run different versions of YARN.
Mesos runs programs + YARN fakeout - There's also a recent hack where you can actually launch things in YARN which are signed resources by YARN, which can then be temporarily released back to Mesos. It takes a while to spin up and spin down YARN; this lets you do that very, very quickly.
One of the requirements here, if we're going to have two YARN clusters like this, is that we very often want a new YARN cluster to refer to data from our previous incarnation of that YARN cluster or to data produced by another YARN cluster. The idea is that you have to have a universal name space below that. You have to have a universal way to persist data.
Here are some use cases that I have seen on practically a daily basis. We use systems very much like this in our own internal production.
Use Case #1: I Want a Cluster
Very common need - I want a cluster. Can I have it for like 30 minutes? Do I want to go to AWS, spend ten minutes launching nodes and installing software…do I want to spend all that time doing that? I want to have the cluster for 20 minutes. Do I want to spend 20 or 30 minutes negotiating with everything to get in there? No, I want it now. I want it a minute from now. I want to do a tiny bit of work on it, and then I want to give it back. This idea I want an ephemeral cluster is not virtualization. It is not the very heavyweight sort of thing. It's, “I want something really ephemeral right now that's isolated from other resources. It does what it does, and it does what it does really well, and I want it without having to think too much about exactly which version people have. I want it just to light up and use it. I want to be able to use it for QA and for development. I want to do compatibility testing, so I might want to be testing my applications to see if they're going to work with the next version of YARN.
YARN doesn’t run YARN well - Now YARN could conceivably do the job of Mesos here, but YARN doesn't run YARN. If YARN run the next version of YARN, then how would you ever unencapsulate it to be able to step forward the next version? It's just mind-boggling and confusing.
Myriad does this trivially, but…Mesos provides a very low-level executor, and that gives you what you really want. Now Myriad does all of these trivially, but you need data localization. I need to be able to say this data lives here. It will continue to live there, and then later I need to run my cluster as near to there as possible. It also needs a universal name space. I need to be able to refer to my name by data as I did before no matter where I happen to be running at this moment. If there are multiple clusters for persistence underneath, then I need to be able to refer to them from anywhere.
Use Case #2: YARN Version Upgrade
Another use case is the version upgrade problem. We have this massive thing; it's running production, and it’s running tens of thousands of jobs a day. Oh – here’s a new version, so you have to upgrade. How? Which of those 50,000 jobs can you stop for a while? Stopping for a while is not in the cards. What they want is to be able to light up test versions so they could qualify one application at a time. They want to be building up the new version and moving applications across to it. They want resources to follow the applications, so as we light up the new version of YARN, we want to move this application across. And as that one starts running, we want to move another one, and we want to grow the resources assigned to that new cluster, and decrease the resources assigned to the old cluster according to how many applications are migrated.
This is a fact of life. You cannot have an atomic moment in time when everything upgrades. You cannot upgrade the platform in all applications at the same moment in time without having your users kill you. It just will not happen.
YARN doesn’t run YARN well (again) YARN doesn't really run YARN. Imagine trying to run one YARN here, have the next version there, and have that become the primary version….it just boggles the mind, and it’s just not going to happen. You need to break that cycle. You can't really have data center operational recursion.
Myriad does this task also trivially, but oddly enough, you need, again, data localization and universal name space. That's because the new YARN has to access and interoperate with the old data. It has to produce new data that old applications interoperate with. As you're migrating things over to the new YARN, you have to be touching exactly the same data.
Use Case #3: Source Slosh
I call our third use case resource sloshing. This comes from a conversation I had with one of our largest customers. We had not yet released our beta project, and they went ahead and put it in to massive production. Everybody in this room I can say with high confidence touches their product. Trillions of dollars are aggregated through their products. 96% of the world's web population touches their product monthly.
Resource slosh - The way they work is that they have this big data ingestion pulse. Hundreds and hundreds of servers are polling data from hundreds of thousands of other servers out in the wild. It's a huge load to move and collate all the data into usable storage. They cannot analyze that data because they have to do accurate summations on all of the data. All of the data coming in comes in very heavily at first, and then the stragglers come in. Once the right percentage of the data has arrived, they start analytics. That means that the load due to ingestion has gone to near zero by the time they start doing analytics, and analytics, of course, is zero load until they start it. That means they have this huge amount of resources specialized for ingestion, specialized for Hadoop, and they alternate the usage. By nature, the utilization of this cluster can't be more than 50 or 60%. It just can't be because you're using it for one thing or the other.
Conflict between sysop/Linux Hadoop viewpoints - Because the ingestion stuff is classic Sysop/Linux sort of stuff- web server ingest sort of software, it's never going to run on Hadoop sorts of things, but it already does work pretty darn well on Mesos.
Myriad does this trivially, but - with Myriad, they could slash the resources back and forth. They could boost up that ingest cluster to enormous levels, ingest the data into a consistent data layer, and then they can roll back the resources, rolling up the resources on the analysis side, and they could cut their hardware costs by 30 or 40%, and they've got a lot of hardware. Myriad does this trivially. Can you hear the refrain going? But, it must have data localization and universal name space; that's what's required to make this work.
Omega Paper - Even though Myriad is a very young project, we have learned some lessons. Actually, not all of them are from this project. The Omega paper from Google is a very interesting read. The paper contains their views on advanced scheduling, and it’s very clear from their work that a single scheduler framework cannot be viable for a broad spectrum of uses. The Mesos community has responded to this analysis, and they've enabled multi-level scheduling very much along the lines of what Google talking about in the Omega paper.
Multi-cultural software is cool - Multi-cultural software really is multi-cultural; you have the Hadoop culture, and you have the Sysop culture. They often talk past each other when they talk about what they need. They really do say nearly opposite things without really understanding what the other side is saying, but this multiculturalism really could be an advantage here if we value both sides. So let's put them together; that's what we're doing with Myriad.
One incubator project (Slider) doesn’t change that - There's been one incubator project called Slider, which attempts to put containers under YARN, but it doesn't untangle all of the problems that we talked about earlier with a uni-cultural solution. We need a multi-cultural solution, and that's what Myriad is.
Incubator - There's a proposal for Myriad right now at Apache. The discussion period has been very positive. The initial team is from Mesosphere, eBay, and MapR.
Community building – We’ve already met the diversity goals of Apache community building, but it still needs to be considerably larger. You can contribute to a community like this by merely expressing what you need out of it—that's a huge contribution. Myriad is a very open project, so anybody in this room would be very welcome to join. During incubator, of course, it's much easier to join a project because things are new; it's a pioneering setting.
The future also involves a cultural reference here. Older whiskey, faster horses, more features—that's where it needs to go. The goals of Myriad from here are not world domination; it's co-existence by specialization—valuing what each system does well and driving it forward.
Want to learn more? Check out these Myriad Project resources: