Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-6612

ZooKeeperStateHandleStore does not guard against concurrent delete operations

    XMLWordPrintableJSON

Details

    Description

      The ZooKeeperStateHandleStore does not guard against concurrent delete operations which could happen in case of a lost leadership and a new leadership grant. The problem is that checkpoint nodes can get deleted even after they have been recovered by another ZooKeeperCompletedCheckpointStore. This corrupts the recovered checkpoint and thwarts future recoveries.

      I propose to add reference counting to the ZooKeeperStateHandleStore. That way, we can monitor how many concurrent processes have a hold on a given checkpoint node. Only if the reference count reaches 0, we are allowed to delete the checkpoint node and dispose the checkpoint data.

      Stephan proposed to use ephemeral child nodes to track the reference count of a checkpoint node. That way we are sure that locks on the a checkpoint node are released in case of JobManager failures.

      Attachments

        Activity

          People

            trohrmann Till Rohrmann
            trohrmann Till Rohrmann
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: