Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.13.5
Description
The symptom is that SlotManager metrics are missing (taskslotsavailable and taskslotstotal) when a SlotManager is suspended and then restarted. We noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 1.15.x, and master.
When a SlotManager is suspended, the metrics group is closed. When the SlotManager is started again, it makes an attempt to reregister metrics but that fails because the underlying metrics group is still closed
I was able to trace through this issue by restarting zookeeper nodes in a staging environment and watching the JM with a debugger.
A concise test, which currently fails, shows the expected behavior – https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1
I am happy to provide a PR to fix this issue, but first would like to verify that this is not intended.