Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-13633

Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of high-availability storage

    XMLWordPrintableJSON

Details

    • All highly available artifacts stored by Apache Flink will now be stored under `HA_STORAGE_DIR/HA_CLUSTER_ID` with `HA_STORAGE_DIR` configured by `high-availability.storageDir` and `HA_CLUSTER_DI` configured by `high-availability.cluster-id`.

    Description

      Currently, if we enable the high-availability, the ha storage directory structure is stored as below. The submittedJobGraph and completedCheckpoint are directly stored under the ha storage path. It is reasonable when the flink cluster finished normally. However, when the Yarn application is failed or killed, the submittedJobGraph and completedCheckpoint will exist there forever. Even we could not know which flink cluster(Yarn application) they belongs to. So i suggest to move them into application subdirectory. Some external tools could be used to clean up these residual files.

      Also, we need to do best effort clean-up before the flink cluster finishes. 

      Current ha storage directory structure

      └── <high-availability.storageDir>
          ├── submittedJobGraph
          ├                  ├ <jobgraph1>(random named)
          ├                  ├ <jobgraph2>(random named)
          ├── completedCheckpoint
          ├              ├ <checkpoint1>(random named)
          ├              ├ <checkpoint2>(random named)
          ├              ├ <checkpoint3>(random named)
          ├── <high-availability.cluster-id>
                 ├── blob
                        ├── <blob1>(named as [no_job|job_<job-id>]/blob_<blob-key>)
      

       

      The new ha storage directory structure

      └── <high-availability.storageDir>
          ├── <high-availability.cluster-id>
                    ├── submittedJobGraph
                    ├                  ├ <jobgraph1>(random named)
                    ├                  ├ <jobgraph2>(random named)
                    ├── completedCheckpoint
                    ├               ├ <checkpoint1>(random named)
                    ├               ├ <checkpoint2>(random named)
                    ├               ├ <checkpoint1>(random named)
                    ├── blob
                           ├── <blob1>(named as [no_job|job_<job-id>]/blob_<blob-key>) 

      Attachments

        Activity

          People

            wangyang0918 Yang Wang
            wangyang0918 Yang Wang
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 10m
                10m