[FLINK-22636] Group job specific ZooKeeper HA services under common jobs/<JobID> zNode - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.13.0, 1.14.0, 1.12.3
Fix Version/s: 1.14.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Release Note:

Hide
The ZooKeeper job-specific HA services are now grouped under a zNode with the respective `JobID`. Moreover, the config options `high-availability.zookeeper.path.latch`, `high-availability.zookeeper.path.leader`, `high-availability.zookeeper.path.checkpoints` and `high-availability.zookeeper.path.checkpoint-counter` have been removed and, thus, have no longer an effect.

Show
The ZooKeeper job-specific HA services are now grouped under a zNode with the respective `JobID`. Moreover, the config options `high-availability.zookeeper.path.latch`, `high-availability.zookeeper.path.leader`, `high-availability.zookeeper.path.checkpoints` and `high-availability.zookeeper.path.checkpoint-counter` have been removed and, thus, have no longer an effect.

Description

In order to better clean up Zookeeper HA services, I suggest grouping job-specific services under a common jobs/<JobID> zNode. That way, it becomes trivial to clean up the job-specific Zookeeper data (simply deleting the jobs/<JobID> node.

Currently, our Zookeeper structure is not really structured well. The current layout looks like this:

clusterID -> jobgraphs -> <job-id>
          -> checkpoints -> <job-id> -> checkpoint-1
          -> checkpoint-counter -> <job-id> -> counter
          -> leaderlatch -> dispatcher_lock
                         -> resourc_emanager_lock
                         -> <job-id>
          -> leader -> dispatcher_lock
                    -> resource_manager_lock
                    -> <job-id>

The new layout could look like this:

clusterID -> jobgraphs -> <job-id>
          -> jobs -> <job-id> -> checkpoints -> checkpoint-1
                              -> checkpoint_id_counter -> counter
                              -> leader -> latch
                                        -> connection_info
          -> leader -> dispatcher -> latch
                                  -> connection_info
                    -> resource_manager -> latch
                                        -> connection_info

Attachments

Issue Links

causes

FLINK-22745 MesosWorkerStore is started with an illegal namespace

Closed

FLINK-22784 Jepsen tests broken due to change in zNode layout

Closed

is related to

FLINK-20695 Zookeeper node under leader and leaderlatch is not deleted after job finished

Closed

links to

Github PR #15893

GitHub Pull Request #15893

Activity

Till Rohrmann added a comment - 18/May/21 13:46

Fixed via 79936be37dff2756f3829f89deec00a676db323d

Till Rohrmann added a comment - 18/May/21 13:46 Fixed via 79936be37dff2756f3829f89deec00a676db323d

People

Assignee:: Till Rohrmann

Reporter:: Till Rohrmann

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 11/May/21 14:47

Updated:: 28/Aug/21 12:10

Resolved:: 18/May/21 13:46

Flink