[FLINK-12131] Resetting ExecutionVertex in region failover may cause inconsistency of IntermediateResult status - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.9.0
Fix Version/s: 1.9.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

Two status may not be correct with region failover and current reset logic.

numberOfRunningProducers in IntermediateResult.
hasDataProduced in IntermediateResultPartition.

This is because currently only when the ExecutionJobVertex is reset will the related IntermediateResult(and the inner IntermediateResultPartition) get reset. But region failover only resets the affected ExecutionVertex(es), rather than the entire ExecutionJobVertex, leaving the status listed above in an inconsistent state.

Problems below may occur as a result:

when a FINISHED vertex is restarted and finishes again, the IntermediateResult.numberOfRunningProducers may drop below 0 and throws exception to trigger global failover
the IntermediateResult.numberOfRunningProducers can be smaller than fact, letting the downstream vertices scheduled earlier than expected
the IntermediateResultPartition is reset and not started yet but the hasDataProduced remains true

That's why I'd propose we add IntermediateResult status adjust logic to ExecutionVertex.resetForNewExecution()**.

Detailed design: https://docs.google.com/document/d/1YA3k8rwDEv1UdaV9NwoDmwc-XorG__JUXlpyJtDs4Ss/edit?usp=sharing

Attachments

Issue Links

links to

GitHub Pull Request #8158

Activity

People

Assignee:: Zhu Zhu

Reporter:: Zhu Zhu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Apr/19 04:53

Updated:: 27/Apr/19 09:47

Resolved:: 27/Apr/19 09:46

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m