Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-27420

Suspended SlotManager fails to re-register metrics when started again

Agile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      The symptom is that SlotManager metrics are missing (taskslotsavailable and taskslotstotal) when a SlotManager is suspended and then restarted. We noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 1.15.x, and master.

       

      When a SlotManager is suspended, the metrics group is closed. When the SlotManager is started again, it makes an attempt to reregister metrics but that fails because the underlying metrics group is still closed 

       

      I was able to trace through this issue by restarting zookeeper nodes in a staging environment and watching the JM with a debugger. 

       

      A concise test, which currently fails, shows the expected behavior – https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1

       

      I am happy to provide a PR to fix this issue, but first would like to verify that this is not intended.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            baugarten Ben Augarten
            baugarten Ben Augarten
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment