[FLINK-25888] Capture time that the job spends on deploying tasks - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.15.0
Component/s: Runtime / Coordination, Runtime / Metrics
Labels:
- pull-request-available

Description

~~FLINK-23976~~ added standardized metrics for capturing how much time we spend in each JobStatus. However, certain states in practice consist of several stages; for example the RUNNING state also includes the deployment of tasks.

To get a better picture on where time is spent I propose to add new metrics that capture the deployingTime based on the execution states. This will additionally get us closer to a proper uptime metric, which ideally will be runningTime - various stage time metrics.

A job is considered to be deploying,

for batch jobs, if no task is running and at least one task is being deployed
for streaming jobs, if at least one task is being deployed

The semantics are different for batch/streaming jobs because they differ in terms of how they make progress. For a streaming job all tasks need to be deployed for checkpointing to make work. For batch jobs any deployed task immediately starts progressing the job.

Attachments

Issue Links

links to

GitHub Pull Request #18566

Activity

People

Assignee:: Chesnay Schepler

Reporter:: Chesnay Schepler

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 31/Jan/22 09:13

Updated:: 08/Feb/22 18:42

Resolved:: 08/Feb/22 18:42