Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4256 Fine-grained recovery
  3. FLINK-12131

Resetting ExecutionVertex in region failover may cause inconsistency of IntermediateResult status

    XMLWordPrintableJSON

    Details

      Description

      Two status may not be correct with region failover and current reset logic.

      1. numberOfRunningProducers in IntermediateResult.
      2. hasDataProduced in IntermediateResultPartition.

      This is because currently only when the ExecutionJobVertex is reset will the related IntermediateResult(and the inner IntermediateResultPartition) get reset. But region failover only resets the affected ExecutionVertex(es),  rather than the entire ExecutionJobVertex, leaving the status listed above in an inconsistent state.

      Problems below may occur as a result:

      1. when a FINISHED vertex is restarted and finishes again, the IntermediateResult.numberOfRunningProducers may drop below 0 and throws exception to trigger global failover
      2. the IntermediateResult.numberOfRunningProducers can be smaller than fact, letting the downstream vertices scheduled earlier than expected
      3. the IntermediateResultPartition is reset and not started yet but the hasDataProduced remains true

      That's why I'd propose we add IntermediateResult status adjust logic to ExecutionVertex.resetForNewExecution()**.

      Detailed design: https://docs.google.com/document/d/1YA3k8rwDEv1UdaV9NwoDmwc-XorG__JUXlpyJtDs4Ss/edit?usp=sharing 

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                zhuzh Zhu Zhu
                Reporter:
                zhuzh Zhu Zhu
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m