[FLINK-6612] ZooKeeperStateHandleStore does not guard against concurrent delete operations - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.3.0, 1.4.0
Fix Version/s: 1.3.0, 1.4.0
Component/s: Runtime / Coordination, Runtime / State Backends
Labels:
None

Description

The ZooKeeperStateHandleStore does not guard against concurrent delete operations which could happen in case of a lost leadership and a new leadership grant. The problem is that checkpoint nodes can get deleted even after they have been recovered by another ZooKeeperCompletedCheckpointStore. This corrupts the recovered checkpoint and thwarts future recoveries.

I propose to add reference counting to the ZooKeeperStateHandleStore. That way, we can monitor how many concurrent processes have a hold on a given checkpoint node. Only if the reference count reaches 0, we are allowed to delete the checkpoint node and dispose the checkpoint data.

Stephan proposed to use ephemeral child nodes to track the reference count of a checkpoint node. That way we are sure that locks on the a checkpoint node are released in case of JobManager failures.

Attachments

Issue Links

links to

GitHub Pull Request #3939

GitHub Pull Request #3940

Activity

People

Assignee:: Till Rohrmann

Reporter:: Till Rohrmann

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/May/17 09:32

Updated:: 19/May/17 10:09

Resolved:: 19/May/17 10:09