Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-10644 Batch Job: Speculative execution
  3. FLINK-11309

Make SpillableSubpartition repeatably read to enable

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.6.2, 1.7.0, 1.7.1
    • Fix Version/s: None
    • Component/s: Runtime / Task

      Description

      Hi all,

      When running the batch WordCount example,  I configured the job execution mode as BATCH_FORCED, and failover-strategy as region, I manually injected some errors to let the execution fail in different phases. In some cases, the job could recovery from failover and became succeed, but in some cases, the job retried several times and failed.

      Example:

      1. If the failure occurred before task read data, e.g., failed before invokable.invoke() in Task.java, failover could succeed.
      2. If the failure occurred after task having read data, failover did not work.

       

      Problem diagnose:

      Running the example described before, each ExecutionVertex is defined as a restart region, and the ResultPartitionType between executions is BLOCKING.  Thus, SpillableSubpartition and SpillableSubpartitionView are used to write/read shuffle data, and data block is described as BufferConsumer stored in a list called buffers, when task requires input data from SpillableSubpartitionView, BufferConsumer is REMOVED from buffers. Thus, when failures occurred after having read data, some BufferConsumers have already released, although tasks retried, the input data is incomplete.

       

      Fix Proposal:

      1. BufferConsumer should not be removed from buffers until ExecutionVertex terminates.
      2. SpillableSubpartition should not be released until ExecutionVertex terminates.
      3. SpillableSubpartition could creates multi SpillableSubpartitionViews, each of which is corresponding to a Execution.

       Design doc: https://docs.google.com/document/d/1uXuJFiKODf241CKci3b0JnaF3zQ-Wt0V9wmC7kYwX-M/edit?usp=sharing

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                eaglewatcher BoWang
                Reporter:
                eaglewatcher BoWang
              • Votes:
                1 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h