Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4256 Fine-grained recovery
  3. FLINK-12131

Resetting ExecutionVertex in region failover may cause inconsistency of IntermediateResult status

    XMLWordPrintableJSON

Details

    Description

      Two status may not be correct with region failover and current reset logic.

      1. numberOfRunningProducers in IntermediateResult.
      2. hasDataProduced in IntermediateResultPartition.

      This is because currently only when the ExecutionJobVertex is reset will the related IntermediateResult(and the inner IntermediateResultPartition) get reset. But region failover only resets the affected ExecutionVertex(es),  rather than the entire ExecutionJobVertex, leaving the status listed above in an inconsistent state.

      Problems below may occur as a result:

      1. when a FINISHED vertex is restarted and finishes again, the IntermediateResult.numberOfRunningProducers may drop below 0 and throws exception to trigger global failover
      2. the IntermediateResult.numberOfRunningProducers can be smaller than fact, letting the downstream vertices scheduled earlier than expected
      3. the IntermediateResultPartition is reset and not started yet but the hasDataProduced remains true

      That's why I'd propose we add IntermediateResult status adjust logic to ExecutionVertex.resetForNewExecution()**.

      Detailed design: https://docs.google.com/document/d/1YA3k8rwDEv1UdaV9NwoDmwc-XorG__JUXlpyJtDs4Ss/edit?usp=sharing 

      Attachments

        Issue Links

          Activity

            People

              zhuzh Zhu Zhu
              zhuzh Zhu Zhu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m