[SLING-5965] Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: Commons Scheduler 2.5.0
Fix Version/s: Commons Scheduler 2.7.0
Component/s: Commons
Labels:
None

Description

Sling Scheduler jobs (aka Quartz-Jobs) should typically be fast running jobs. They are served from a thread-pool and should occupy that thread only for a short amount of time.

If there are 'misbehaving' quartz-jobs that run for a very long time, they start to occupy threads from that thread-pool, thus have an influence on the performance of other scheduled/quartz-jobs.

We should have metrics (using sling.commons.metrics) that provide information about internas of Sling Scheduler, such as average, max etc duration of scheduled jobs, as well as how many jobs are currently running and since when was the oldest job running.

Based on this, a Health-Check can monitor the 'oldest job running' metric and flag critical when eg the oldest job is older than 60'000ms (configurable, default).