Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-6612

ZooKeeperStateHandleStore does not guard against concurrent delete operations

    Details

      Description

      The ZooKeeperStateHandleStore does not guard against concurrent delete operations which could happen in case of a lost leadership and a new leadership grant. The problem is that checkpoint nodes can get deleted even after they have been recovered by another ZooKeeperCompletedCheckpointStore. This corrupts the recovered checkpoint and thwarts future recoveries.

      I propose to add reference counting to the ZooKeeperStateHandleStore. That way, we can monitor how many concurrent processes have a hold on a given checkpoint node. Only if the reference count reaches 0, we are allowed to delete the checkpoint node and dispose the checkpoint data.

      Stephan proposed to use ephemeral child nodes to track the reference count of a checkpoint node. That way we are sure that locks on the a checkpoint node are released in case of JobManager failures.

        Attachments

          Activity

            People

            • Assignee:
              till.rohrmann Till Rohrmann
              Reporter:
              till.rohrmann Till Rohrmann
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: