When TaskSchedulerImpl fails to find an open slot for a task, it falls back to the preemptor:
This can be problematic when the task store is large (O(10k tasks)) and there is a steady supply of PENDING tasks not satisfied by open slots. This will manifest as an overall degraded/slow scheduler, and logs of slow queries used for preemption:
Several approaches come to mind to improve this situation (not mutually exclusive):
- (easy) More aggressively back off on tasks that cannot be satisfied
- (easy) Fall back to preemption less frequently
- (easy) Gather the list of slaves from AttributeStore rather than TaskStore. This breaks the operation up into many smaller queries and reduces the amount of work in cases where a match is found. However, this would actually create more work when a match is not found, so this approach is probably not helpful by itself.
- (harder) Scan for preemption candidates asynchronously, freeing up the TaskScheduler thread and the storage write lock. Scans could be kicked off by the task scheduler, ideally in a way that doesn't dogpile. This could also be done in a weakly-consistent way to minimally contribute to storage contention.