Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-14685

ZooKeeperCheckpointIDCounter forever broken if once loss connection with ZK

    XMLWordPrintableJSON

    Details

      Description

      Currently, if ZooKeeperCheckpointIDCounter suffers SUSPENDED state i.e. connection loss, it will set the state as invalid so that all checkpoint id counter operations succeed will fail.

      Although couple with JM leadership management we will generate a new id counter on re-granted leadership so that it is not a problem so far, the semantic is wrong because id counter should only check whether current state is SUSPENDED/LOST.

      It is also a blocker upgrading to Curator 4.2 and tolerate SUSPENDED state in LeaderLatch. lamber-ken provides a fix there.

      Besides, in product scenario we once noticed that JM didn't re-elected(it shouldn't happen after Till Rohrmann add linearized leader operation) on SUSPENDED-RECONNECTED very fast so that a JM runs with a broken ID counter.

      I think it is reasonable we pick lamber-ken's commit as a separated issue and fix this wrong semantic.

      CC Gary Yao Andrey Zagrebin

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tison Zili Chen
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: