Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12296

Data loss silently in RocksDBStateBackend when more than one operator(has states) chained in a single task

    XMLWordPrintableJSON

    Details

      Description

      As the mail list said[1], there may be a problem when more than one operator chained in a single task, and all the operators have states, we'll encounter data loss silently problem.

      Currently, the local directory we used is like below

      ../local_state_root_1/allocation_id/job_id/vertex_id_subtask_idx/chk_1/(state),

       

      if more than one operator chained in a single task, and all the operators have states, then all the operators will share the same local directory(because the vertext_id is the same), this will lead a data loss problem. 

       

      The path generation logic is below:

      // LocalRecoveryDirectoryProviderImpl.java
      
      @Override
      public File subtaskSpecificCheckpointDirectory(long checkpointId) {
         return new File(subtaskBaseDirectory(checkpointId), checkpointDirString(checkpointId));
      }
      
      
      @VisibleForTesting
      String subtaskDirString() {
         return Paths.get("jid_" + jobID, "vtx_" + jobVertexID + "_sti_" + subtaskIndex).toString();
      }
      
      @VisibleForTesting
      String checkpointDirString(long checkpointId) {
         return "chk_" + checkpointId;
      }
      

      [1] http://mail-archives.apache.org/mod_mbox/flink-user/201904.mbox/%3Cm2ef5tpfwy.wl-ningshi2@gmail.com%3E

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                klion26 Congxian Qiu(klion26)
                Reporter:
                klion26 Congxian Qiu(klion26)
              • Votes:
                3 Vote for this issue
                Watchers:
                15 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m