Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Done
-
None
-
All highly available artifacts stored by Apache Flink will now be stored under `HA_STORAGE_DIR/HA_CLUSTER_ID` with `HA_STORAGE_DIR` configured by `high-availability.storageDir` and `HA_CLUSTER_DI` configured by `high-availability.cluster-id`.
Description
Currently, if we enable the high-availability, the ha storage directory structure is stored as below. The submittedJobGraph and completedCheckpoint are directly stored under the ha storage path. It is reasonable when the flink cluster finished normally. However, when the Yarn application is failed or killed, the submittedJobGraph and completedCheckpoint will exist there forever. Even we could not know which flink cluster(Yarn application) they belongs to. So i suggest to move them into application subdirectory. Some external tools could be used to clean up these residual files.
Also, we need to do best effort clean-up before the flink cluster finishes.
Current ha storage directory structure
└── <high-availability.storageDir> ├── submittedJobGraph ├ ├ <jobgraph1>(random named) ├ ├ <jobgraph2>(random named) ├── completedCheckpoint ├ ├ <checkpoint1>(random named) ├ ├ <checkpoint2>(random named) ├ ├ <checkpoint3>(random named) ├── <high-availability.cluster-id> ├── blob ├── <blob1>(named as [no_job|job_<job-id>]/blob_<blob-key>)
The new ha storage directory structure
└── <high-availability.storageDir> ├── <high-availability.cluster-id> ├── submittedJobGraph ├ ├ <jobgraph1>(random named) ├ ├ <jobgraph2>(random named) ├── completedCheckpoint ├ ├ <checkpoint1>(random named) ├ ├ <checkpoint2>(random named) ├ ├ <checkpoint1>(random named) ├── blob ├── <blob1>(named as [no_job|job_<job-id>]/blob_<blob-key>)