[FLINK-14439] RestartPipelinedRegionStrategy leverage tracked partition availability for better failover experience in DefaultScheduler - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.10.0
Fix Version/s: 1.10.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

In current region failover when using DefaultScheduler, most of the input result partition states are unknown. Even though the failure cause is a PartitionException, only one unhealthy partition can be identified.

The may lead to multiple unsuccessful failovers before all the unhealthy but needed partitions are identified and their producers are involved in the failover as well. (unsuccessful failover here means the recovered tasks get failed again soon due to some missing input partitions.)

Using JM side tracked partition states to help the region failover to identify unhealthy(missing) partitions earlier can help with this case.

To achieve it, I'd propose as follows:
1. Change FailoverStrategy.Factory#create(FailoverTopology) to FailoverStrategy.Factory#create(FailoverTopology, ResultPartitionAvailabilityChecker).
2. Add schedulerBase#getResultPartitionAvailabilityChecker which returns getExecutionGraph().getResultPartitionAvailabilityChecker()
3. In DefaultScheduler use the ResultPartitionAvailabilityChecker from SchedulerBase to create the failover strategy from the factory

It also fails BatchFineGrainedRecoveryITCase due to unexpected failover counts. This is because the legacy scheduler already has similar optimization in ~~FLINK-13055~~.

Attachments

Issue Links

Add Link

blocks

FLINK-14440 Enable BatchFineGrainedRecoveryITCase to pass with scheduler NG

Closed

Delete this link

links to

GitHub Pull Request #10043

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Zhu Zhu

Reporter:: Zhu Zhu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Oct/19 14:44

Updated:: 31/Oct/19 16:28

Resolved:: 31/Oct/19 16:28

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

RestartPipelinedRegionStrategy leverage tracked partition availability for better failover experience in DefaultScheduler

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Time Tracking

Agile

Slack

Issue deployment