[FLINK-31144] Slow scheduling on large-scale batch jobs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.17.0, 1.15.3, 1.16.1
Fix Version/s: 1.17.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

When executing a complex job graph at high parallelism `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` can get slow and cause long pauses where the JobManager becomes unresponsive and all the taskmanagers just wait. I've attached a VisualVM snapshot to illustrate the problem.flink-1.17-snapshot-1676473798013.nps

At Spotify we have complex jobs where this issue can cause batch "pause" of 40+ minutes and make the overall execution 30% slower or more.
More importantly this prevent us from running said jobs on larger cluster as adding resources to the cluster worsen the issue.

We have successfully tested a modified Flink version where `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` was completely commented and simply returns an empty collection and confirmed it solves the issue.

In the same spirit as a recent change (https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L98-L102) there could be a mechanism in place to detect when Flink run into this specific issue and just skip the call to `getInputLocationFutures` https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L105-L108.

I'm not familiar enough with the internals of Flink to propose a more advanced fix, however it seems like a configurable threshold on the number of consumer vertices above which the preferred location is not computed would do. If this solution is good enough, I'd be happy to submit a PR.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

flink-1.17-snapshot-1676473798013.nps
20/Feb/23 16:54
50 kB
Julien Tournay
image-2023-02-21-10-29-49-388.png
21/Feb/23 09:29
149 kB
Julien Tournay
Screenshot 2023-03-13 at 14.22.27.png
13/Mar/23 13:22
235 kB
Martijn Visser

Issue Links

links to

GitHub Pull Request #22098

Activity

People

Assignee:: Junrui Li

Reporter:: Julien Tournay

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 20/Feb/23 16:59

Updated:: 03/Apr/23 09:02

Resolved:: 10/Mar/23 09:11