Details
-
Sub-task
-
Status: Closed
-
Minor
-
Resolution: Abandoned
-
None
-
None
-
None
Description
Implement StateHandleStore using jetcd.
Previously ZooKeeperStateHandleStore stores data in a dfs file while records its metadata to a zookeeper node in order to keep the data size of a zookeeper node small. EtcdStateHandleStore should work in the same way while there is a corner case that should be carefully dealt with using etcd.
As described in FLINK-6612, the ZooKeeperStateHandleStore does not guard against concurrent delete operations which could happen in case of a lost leadership and a new leadership grant. The problem is that checkpoint nodes can get deleted even after they have been recovered by another ZooKeeperCompletedCheckpointStore. This corrupts the recovered checkpoint and thwarts future recoveries. In order to guard against deletions of ZooKeeper nodes which are still being used by a different ZooKeeperStateHandleStore, a locking mechanism has been introduced to make sure that zookeeper nodes are allowed to be deleted only after all ZooKeeperStateHandleStores have released their lock. The locking mechanism is implemented via ephemeral child nodes of the respective ZooKeeper node. Whenever a ZooKeeperStateHandleStore wants to lock a ZNode, thus, protecting it from being deleted, it creates an ephemeral child node. The node's name is unique to the ZooKeeperStateHandleStore instance. The delete operations will then only delete the node if it does not have any children associated. In order to guard against orphaned lock nodes, they are created as ephemeral nodes. This means that they will be deleted by ZooKeeper once the connection of the ZooKeeper client which created the node timed out.
The solution leverages that a zookeeper directory cannot be deleted if it still has child nodes and a ephemeral node will be deleted by zookeeper server once the corresponding client is disconnected. Since etcd doesn’t have hierarchical structure like zookeeper, there is no actual relations between a path and its so-called parent directory. Suppose we create a persistent key-value to store metadata and then create an ephemeral key-value using previous key as the prefix. Once the client disconnects with etcd server, the ephemeral key-value pair will be deleted from etcd server while there’ll be no effect on its parent key-value pair.
My proposal to tackle this case is illustrated in Part 6 Implementation of etcd based StateHandleStore in design doc. Any comments or suggestions will be appreciated.
Attachments
Issue Links
- is superceded by
-
FLINK-12884 FLIP-144: Native Kubernetes HA Service
- Closed