[FLINK-13633] Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of high-availability storage - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: 1.10.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Release Note:
All highly available artifacts stored by Apache Flink will now be stored under `HA_STORAGE_DIR/HA_CLUSTER_ID` with `HA_STORAGE_DIR` configured by `high-availability.storageDir` and `HA_CLUSTER_DI` configured by `high-availability.cluster-id`.

Description

Currently, if we enable the high-availability, the ha storage directory structure is stored as below. The submittedJobGraph and completedCheckpoint are directly stored under the ha storage path. It is reasonable when the flink cluster finished normally. However, when the Yarn application is failed or killed, the submittedJobGraph and completedCheckpoint will exist there forever. Even we could not know which flink cluster(Yarn application) they belongs to. So i suggest to move them into application subdirectory. Some external tools could be used to clean up these residual files.

Also, we need to do best effort clean-up before the flink cluster finishes.

Current ha storage directory structure

└── <high-availability.storageDir>
    ├── submittedJobGraph
    ├                  ├ <jobgraph1>(random named)
    ├                  ├ <jobgraph2>(random named)
    ├── completedCheckpoint
    ├              ├ <checkpoint1>(random named)
    ├              ├ <checkpoint2>(random named)
    ├              ├ <checkpoint3>(random named)
    ├── <high-availability.cluster-id>
           ├── blob
                  ├── <blob1>(named as [no_job|job_<job-id>]/blob_<blob-key>)

The new ha storage directory structure

└── <high-availability.storageDir>
    ├── <high-availability.cluster-id>
              ├── submittedJobGraph
              ├                  ├ <jobgraph1>(random named)
              ├                  ├ <jobgraph2>(random named)
              ├── completedCheckpoint
              ├               ├ <checkpoint1>(random named)
              ├               ├ <checkpoint2>(random named)
              ├               ├ <checkpoint1>(random named)
              ├── blob
                     ├── <blob1>(named as [no_job|job_<job-id>]/blob_<blob-key>)

Attachments

Issue Links

links to

GitHub Pull Request #9598

Github Pull Request #9598

Activity

People

Assignee:: Yang Wang

Reporter:: Yang Wang

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 07/Aug/19 10:36

Updated:: 17/Sep/19 12:46

Resolved:: 17/Sep/19 12:46

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10m