Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Core idea: Make scaling up sensitive to prevent lags, and make scaling down insensitive to reduce restart frequency.
Background & Motivation
We enabled autoscaler scaling for a few flink production jobs. It works with Adaptive Scheduler and Rescale api.
Scaling results:
- The recommended parallelism meets expectations most of the time
- When the source traffic increases, the autoscaler scales up the job in time to prevent lags.
- When the source traffic decreases, the autoscaler scales down job in time to save resources
- Pain point: Each job rescales more than 20 times a day (job.autoscaler.metrics.window=15 min by default).
As we all know, the job will be unavailable for a while during the restart for some reasons:
- Cancel job
- Request resources( FLIP-472 is optimizing it)
- Initialize task
- Restore state
- Catch up lag during restart
- etc
Expectations:
- Scaling up in time to prevent lags.
- Lazy scaling down to reduce downtime and ensure resources can be released later.
Solution:
- Introduce job.autoscaler.scale-down.interval, the default value could be 1 hour.
- Replace job.autoscaler.scale-up.grace-period with job.autoscaler.scale-down.interval
Detailed strategies:
- Record the start time of the first scale-down event for each vertex separately. For example:
- vertex1: 2024-08-09 01:35:02
- vertex2: 2024-08-09 01:38:02
- Scaling down will be triggered for some cases:
- Any vertex needs scale up
- Job restart cannot be avoided, so trigger scale down for another vertex as well if needed
- After scale down, clean up the start time of scale-down.
- The scale down lazy period for any vertex is coming
- current time - min(start time for each vertex) > scale-down.lazy-period
- This means that there is no scaling up during the scaling down lazy period
- Any vertex needs scale up
Note1: If the recommend parallelism >= current parallelism, the start time of scale-down will be cleaned up for current vertex.
Note2: The recommended parallelism still comes from the latest 15-minute metrics.For example:
- The current parallelism of vertex1 is 100, the traffic is decreased at night.
- 2024-08-09 01:00:00, the recommended parallelism is 60.
- The start time of scale down is 2024-08-09 01:00:00.
- 2024-08-09 01:15:00, the recommended parallelism is 50.
- Still within the range of scale down lazy period.
- Don't update the start time of scale down.
- 2024-08-09 01:31:00, the recommended parallelism is 40.
- Outside of scale-down.lazy-period, trigger rescale, and use 40 as the recommended parallelism.
- The job.autoscaler.metrics.window is 15 min, so metrics from 2024-08-09 01:16:00 to 2024-08-09 01:31:00
Note3: If users set job.autoscaler.scale-down.interval <=0, we scale down directly.
Attachments
Issue Links
- links to