[AURORA-1096] Scheduler updater should limit the number of job/instance events - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Story
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.9.0
Component/s: Scheduler
Labels:
None

Sprint:
Twitter Aurora Q2'15 Sprint 7
Story Points:
5

Description

Large/flapping scheduler job updates may generate too many events in the update store. The update settings are fully controlled by the user and there is a potential for a misconfigured job update to completely overwhelm our in-memory DB storage with job update instance events.

For example, a large flapping update with max_per_shard_failures and max_total_failures set to max INT when left unattended can quickly consume all available RAM and kill the scheduler. A manual cleanup of the scheduler log would be needed to bring the scheduler up.

This can be especially relevant with the introduction of update heartbeats (~~AURORA-690~~) that can further exacerbate the problem (e.g. when blockIfNoPulseAfterMs set too low wrt the external service pulse rate).

We need to cap the max per-job lifetime count of JobUpdateEvent and JobInstanceUpdateEvent instances. A nice bonus would be providing a hint in the UI when the event sequence is cut off.

Attachments

Issue Links

is related to

AURORA-1097 Scheduler updater should suppress instance events on resume

Resolved

Activity

People

Assignee:: Joe Smith

Reporter:: Maxim Khutornenko

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 04/Feb/15 00:31

Updated:: 20/Jul/15 16:43

Resolved:: 17/Jul/15 16:00