The ZooKeeperStateHandleStore does not guard against concurrent delete operations which could happen in case of a lost leadership and a new leadership grant. The problem is that checkpoint nodes can get deleted even after they have been recovered by another ZooKeeperCompletedCheckpointStore. This corrupts the recovered checkpoint and thwarts future recoveries.
I propose to add reference counting to the ZooKeeperStateHandleStore. That way, we can monitor how many concurrent processes have a hold on a given checkpoint node. Only if the reference count reaches 0, we are allowed to delete the checkpoint node and dispose the checkpoint data.
Stephan proposed to use ephemeral child nodes to track the reference count of a checkpoint node. That way we are sure that locks on the a checkpoint node are released in case of JobManager failures.