Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.17.0
-
None
-
None
Description
Flink Job Metrics returns stale values in the first request after an update in the values.
Repro:
1. Run a flink job with fixed strategy and with multiple attempts
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 10000
flink run -Dexecution.checkpointing.interval="10s" -d -c org.apache.flink.streaming.examples.wordcount.WordCount /usr/lib/flink/examples/streaming/WordCount.jar
2. Kill one of the TaskManager which will initiate job restart.
3. After job restarted, fetch any job metrics. The first time it returns stale (older) value 48.
[hadoop@ip-172-31-44-70 ~]$ curl http://jobmanager:52000/jobs/d24f7d74d541f1215a65395e0ebd898c/metrics?get=numRestarts | jq . [ { "id": "numRestarts", "value": "48" } ]
4. On subsequent runs, it returns the correct value.
[hadoop@ip-172-31-44-70 ~]$ curl http://jobmanager:52000/jobs/d24f7d74d541f1215a65395e0ebd898c/metrics?get=numRestarts | jq . [ { "id": "numRestarts", "value": "49" } ]
5. Repeat steps 2 to 5, which will show that the first request after an update to the metrics returns a previous value before the update. Only on the next request is the correct value returned.