[AURORA-1041] Allow job uptime stats to control scheduler updater pace - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Client, Scheduler
Labels:
None

Description

The current implementation of the scheduler updater relies on a user-defined batch_size value to determine how many instances can be updated simultaneously. While this approach is well understood and battle tested, it comes with its own risks/inefficiencies:

No knowledge of job health outside of an active batch. Once an instance graduates the watch_secs interval it's considered "healthy" and is never looked at by the updater. Even if updated instances start flapping later, the updater keeps on going;
The batch_size fixed value may artificially slow down the updater progress as it's usually chosen conservatively as the max number of instances a service can tolerate at any given moment and may not reflect the actual job restart pace (see related ~~AURORA-894~~).
Instances are evaluated/updated in a ordered fashion resulting in any new instances coming up at the very end of an update sequence that both updates the existing instances and adds new ones.

The proposed solution will capitalize on the concept of job uptime introduced in ~~AURORA-290~~ and will allow scheduler updater to proceed as long as the "X% of instances up over Y interval" job invariant is met.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Maxim Khutornenko

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/Jan/15 21:11

Updated:: 03/Aug/15 18:58