Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12296

Data loss silently in RocksDBStateBackend when more than one operator(has states) chained in a single task

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      As the mail list said[1], there may be a problem when more than one operator chained in a single task, and all the operators have states, we'll encounter data loss silently problem.

      Currently, the local directory we used is like below

      ../local_state_root_1/allocation_id/job_id/vertex_id_subtask_idx/chk_1/(state),

       

      if more than one operator chained in a single task, and all the operators have states, then all the operators will share the same local directory(because the vertext_id is the same), this will lead a data loss problem. 

       

      The path generation logic is below:

      // LocalRecoveryDirectoryProviderImpl.java
      
      @Override
      public File subtaskSpecificCheckpointDirectory(long checkpointId) {
         return new File(subtaskBaseDirectory(checkpointId), checkpointDirString(checkpointId));
      }
      
      
      @VisibleForTesting
      String subtaskDirString() {
         return Paths.get("jid_" + jobID, "vtx_" + jobVertexID + "_sti_" + subtaskIndex).toString();
      }
      
      @VisibleForTesting
      String checkpointDirString(long checkpointId) {
         return "chk_" + checkpointId;
      }
      

      [1] http://mail-archives.apache.org/mod_mbox/flink-user/201904.mbox/%3Cm2ef5tpfwy.wl-ningshi2@gmail.com%3E

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            klion26 Congxian Qiu
            klion26 Congxian Qiu
            Votes:
            3 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 20m
                20m

                Slack

                  Issue deployment