Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-14685

ZooKeeperCheckpointIDCounter forever broken if once loss connection with ZK

    XMLWordPrintableJSON

Details

    Description

      Currently, if ZooKeeperCheckpointIDCounter suffers SUSPENDED state i.e. connection loss, it will set the state as invalid so that all checkpoint id counter operations succeed will fail.

      Although couple with JM leadership management we will generate a new id counter on re-granted leadership so that it is not a problem so far, the semantic is wrong because id counter should only check whether current state is SUSPENDED/LOST.

      It is also a blocker upgrading to Curator 4.2 and tolerate SUSPENDED state in LeaderLatch. lamber-ken provides a fix there.

      Besides, in product scenario we once noticed that JM didn't re-elected(it shouldn't happen after trohrmann add linearized leader operation) on SUSPENDED-RECONNECTED very fast so that a JM runs with a broken ID counter.

      I think it is reasonable we pick lamber-ken's commit as a separated issue and fix this wrong semantic.

      CC GJL azagrebin

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tison Zili Chen
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: