Lei Guo, I suggest you to read the attached techreport for full context, but let me try to summarize the ideas here.
The reservation system receives reservation requests from users over a period of time. Note that each reservation can request resources much ahead of time (e.g., I need 10 containers for 1 hour tomorrow sometime between 3pm and 6pm). The planner will try to "fit" all these reservation in the plan agenda, while respecting the user constraints (e.g., amount of resources and start_time/deadline) and the physical constraints of the plan (which is a "queue", and thus has access to a portion of the cluster capacity). The APIs exposed to the users allow them to expose their flexibility (e.g., for a map-only job I can express the fact that I can run with up to 10 parallel containers, but also 1 container at a time), this allows the plan to fit more jobs by "deforming them". A side effect of this is that we can provide support for gang-semantics (e.g., I need 10 concurrent containers for 1 h).
The key intuition is that each job might temporarily use a large amount of resources, but we control very explicitly when it should yield resources back to other jobs. This explicit time-multiplexing gives very strong guarantees to each job (i.e., if the reservation was accepted you will get your resources), but allows us to densely pack the cluster agenda (and thus get high utilization / high ROI). Moreover, best-effort jobs can be run on separate queues with the standard set of scheduling invariant provided by FairScheduler/CapacityScheduler.
Another interesting area in which enterprise settings can extend/innovate is the choice of "SharingPolicy". The SharingPolicy is a way for us to determine (beside physical resource availability) how much resources can a tenant/reservation ask for in the Plan. This is both
per-reservation and across reservation from a user (or group). We contributed so far a couple of simple policies allowing to enforce instantaneous and over-time limits (e.g., each user can grab up to 30% of the plan instantaneously, but no more than an average of 5%
over a 24h period of time). Internally at MS, we are developing other policies that are specific to business-rules we care to enforce in our clusters. By design, creating a new SharingPolicy that match your business settings is fairly easy (narrow API and easy configuration
mechanics). Since the Plan stores past (up to a window of time), present, future reservations, the policy can be very sophisticated, and explicit. Also given the run-lenght-encoded representation of the allocations, algos can be quite efficient.
The reservation agents are the core of the placement logic. We developed a few, which optimize for different things (e.g., minimize cost of the allocation by smoothing out the plan, or placing as late/early as possible in the window of feasibility). Again this is an area of possible
enhancement, where business logic can kick in and choose to prioritize certain types of allocations.
Finally, in order to "enforce" this planned decisions, we use dynamically created and resized queues (each reservation can contain one or more jobs, thus the queue mechanism is useful to reuse). Note that Arun C Murthy's comment was fairly technical, and related to this
last point. He was proposing to leverage application priorities instead of queues as an enforcement mechanisms. Both are feasible, and have some pros and cons. Overall using queues allowed us to reuse some more of the mechanisms (e.g., rely on the preemption
policy, and all of the advancement people are contributing there).