Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-26987

ZooKeeperStateHandleStore.getAllAndLock ends up in a infinite loop if there's an entry marked for deletion that's not cleaned up, yet

    XMLWordPrintableJSON

Details

    Description

      ZooKeeperStateHandleStore.getAllAndLock is used when recovering CompletedCheckpoints. It iterates over all childs and retries until it reaches a stable and consistent version (i.e. no entries are subject for deletion and no child nodes were added while accessing the ZK instance).

      Additionally, ZooKeeperStateHandleStore marks entries for deletion internally before actually deleting them. This can lead to a state where an entry is marked for deletion but the discard failed causing the cleanup to fail. The entry will be left marked for deletion and another cleanup will be tried. This works infinitely. But the users has the ability to limit the amount of retries. In that case, the entry might be left marked.

      Restarting Flink cluster will now try to access this ZooKeeperStateHandleStore recovering the checkpoints with this entry still being marked for deletion which will cause an error in ZooKeeperStateHandleStore.getAllAndLock which results in a retry loop that's not desired.

      We actually don't need to retry in that case because the child can be ignored, as far as I can see.

      Attachments

        Issue Links

          Activity

            People

              mapohl Matthias Pohl
              mapohl Matthias Pohl
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: