[FLINK-35926] During rescale, AdaptiveScheduler has incorrect judgment logic for the max parallelism. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.19.1
Fix Version/s: None
Component/s: Runtime / Task
Labels:
None
Environment:

flink-1.19.1

There is a high probability that 1.18 has the same problem

Description

When I was using the adaptive scheduler and modified the task in parallel through the rest api, an incorrect decision logic occurred, causing the task to fail.

produce:

When I start a simple job with a parallelism of 128, the Max Parallelism of the job will be set to 256 (through flink's internal calculation logic). Then I make a savepoint on the job and modify the parallelism of the job to 1. Restore the job from the savepoint. At this time, the Max Parallelism of the job is still 256:

this is as expected, at this time I call the rest api to increase the parallelism to 129 (which is obviously reasonable, since it is < 256), but the task throws an exception after restarting:

At this time, when viewing the detailed information of the task, it is found that Max Parallelism has changed to 128:

This can be reproduced stably locally

Causes:

In AdaptiveScheduler we recalculate the job `VertexParallelismStore`,
This results in the job after restart having the wrong max parallelism.

, which seems to be related to ~~FLINK-21844~~ and ~~FLINK-22084~~ .

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2024-07-30-14-56-48-931.png
30/Jul/24 06:56
96 kB
yuanfenghu
image-2024-07-30-14-59-26-976.png
30/Jul/24 06:59
89 kB
yuanfenghu
image-2024-07-30-15-00-28-491.png
30/Jul/24 07:00
181 kB
yuanfenghu

Activity

People

Assignee:: Unassigned

Reporter:: yuanfenghu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 30/Jul/24 08:18

Updated:: 30/Jul/24 08:31