[AURORA-121] Make the preemptor more efficient - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Story
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.7.0
Component/s: Scheduler
Labels:
- O3

Sprint:
Twitter Aurora Q4 Sprint 5

Description

When TaskSchedulerImpl fails to find an open slot for a task, it falls back to the preemptor:

if (!offerQueue.launchFirst(getAssignerFunction(taskId, task))) {
  // Task could not be scheduled.
  maybePreemptFor(taskId);
  return TaskSchedulerResult.TRY_AGAIN;
}

This can be problematic when the task store is large (O(10k tasks)) and there is a steady supply of PENDING tasks not satisfied by open slots. This will manifest as an overall degraded/slow scheduler, and logs of slow queries used for preemption:

I0125 17:47:36.970 THREAD23 org.apache.aurora.scheduler.storage.mem.MemTaskStore.fetchTasks: Query took 107 ms: TaskQuery(owner:null, environment:null, jobName:null,
taskIds:null, statuses:[KILLING, ASSIGNED, STARTING, RUNNING, RESTARTING], slaveHost:null, instanceIds:null)

Several approaches come to mind to improve this situation (not mutually exclusive):

(easy) More aggressively back off on tasks that cannot be satisfied
(easy) Fall back to preemption less frequently
(easy) Gather the list of slaves from AttributeStore rather than TaskStore. This breaks the operation up into many smaller queries and reduces the amount of work in cases where a match is found. However, this would actually create more work when a match is not found, so this approach is probably not helpful by itself.
(harder) Scan for preemption candidates asynchronously, freeing up the TaskScheduler thread and the storage write lock. Scans could be kicked off by the task scheduler, ideally in a way that doesn't dogpile. This could also be done in a weakly-consistent way to minimally contribute to storage contention.

Attachments

Activity

People

Assignee:: Bill Farner

Reporter:: Bill Farner

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 25/Jan/14 18:13

Updated:: 26/Jan/15 22:58

Resolved:: 03/Dec/14 18:52