Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-13633

Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of high-availability storage

    XMLWordPrintableJSON

    Details

    • Release Note:
      All highly available artifacts stored by Apache Flink will now be stored under `HA_STORAGE_DIR/HA_CLUSTER_ID` with `HA_STORAGE_DIR` configured by `high-availability.storageDir` and `HA_CLUSTER_DI` configured by `high-availability.cluster-id`.

      Description

      Currently, if we enable the high-availability, the ha storage directory structure is stored as below. The submittedJobGraph and completedCheckpoint are directly stored under the ha storage path. It is reasonable when the flink cluster finished normally. However, when the Yarn application is failed or killed, the submittedJobGraph and completedCheckpoint will exist there forever. Even we could not know which flink cluster(Yarn application) they belongs to. So i suggest to move them into application subdirectory. Some external tools could be used to clean up these residual files.

      Also, we need to do best effort clean-up before the flink cluster finishes. 

      Current ha storage directory structure

      └── <high-availability.storageDir>
          ├── submittedJobGraph
          ├                  ├ <jobgraph1>(random named)
          ├                  ├ <jobgraph2>(random named)
          ├── completedCheckpoint
          ├              ├ <checkpoint1>(random named)
          ├              ├ <checkpoint2>(random named)
          ├              ├ <checkpoint3>(random named)
          ├── <high-availability.cluster-id>
                 ├── blob
                        ├── <blob1>(named as [no_job|job_<job-id>]/blob_<blob-key>)
      

       

      The new ha storage directory structure

      └── <high-availability.storageDir>
          ├── <high-availability.cluster-id>
                    ├── submittedJobGraph
                    ├                  ├ <jobgraph1>(random named)
                    ├                  ├ <jobgraph2>(random named)
                    ├── completedCheckpoint
                    ├               ├ <checkpoint1>(random named)
                    ├               ├ <checkpoint2>(random named)
                    ├               ├ <checkpoint1>(random named)
                    ├── blob
                           ├── <blob1>(named as [no_job|job_<job-id>]/blob_<blob-key>) 

        Attachments

          Activity

            People

            • Assignee:
              fly_in_gis Yang Wang
              Reporter:
              fly_in_gis Yang Wang
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 10m
                10m