[YUNIKORN-646] Add metrics implementation: "allocating_latency_seconds" - ASF JIRA

XML

Word

Printable

JSON

Observation:

Container allocating latency stays at 0. The number of allocation attempts fluctuates normally.
Scheduler metrics definition is not consistent and sometimes hard to understand.

Root cause analysis:

The metrics "allocating_latency_seconds" is not fully implemented or the implementation is missed in recent releases. For example, ObserveSchedulingLatency() is currently not called when allocating containers.
Scheduler metrics is implemented by multiple developers in the past while not following the same convention.

Improvement Plan:

The top level container allocation latency can be captured by the main scheduling routine in scheduler/context.go. Reason: The schedule() method in scheduler/context.go is the entry point to process each partition in the scheduler, walk over each queue and app to check if anything can be scheduled.
The metrics name "allocating_latency_seconds" can be changed to "scheduling_latency_seconds". Reason: The metrics is initially defined as "schedulingLatency" in metrics/scheduler.go. Naming consistency can help to avoid confusion.
Other metrics definition and help message can be improved to make metrics/scheduler.go consistent. (Open to create a separate PR for the refactoring work.)
New metrics can be further added to monitor lower level latency when the scheduler is iterating over partition list, queues, applications, requests etc. Not included in this PR.

links to

GitHub Pull Request #273