Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-36018

Support lazy scale down to avoid frequent rescaling

    XMLWordPrintableJSON

Details

    Description

      Core idea: Make scaling up sensitive to prevent lags, and make scaling down insensitive to reduce restart frequency.

      Background & Motivation

      We enabled autoscaler scaling for a few flink production jobs. It works with Adaptive Scheduler and Rescale api.

      Scaling results:

      • The recommended parallelism meets expectations most of the time
      • When the source traffic increases, the autoscaler scales up the job in time to prevent lags.
      • When the source traffic decreases, the autoscaler scales down job in time to save resources
      • Pain point: Each job rescales more than 20 times a day (job.autoscaler.metrics.window=15 min by default).

      As we all know, the job will be unavailable for a while during the restart for some reasons:

      • Cancel job
      • Request resources( FLIP-472 is optimizing it)
      • Initialize task
      • Restore state
      • Catch up lag during restart
      • etc

      Expectations:

      • Scaling up in time to prevent lags.
      • Lazy scaling down to reduce downtime and ensure resources can be released later.

      Solution:

      • Introduce job.autoscaler.scale-down.interval, the default value could be 1 hour.
      • Replace job.autoscaler.scale-up.grace-period with job.autoscaler.scale-down.interval

      Detailed strategies:

      • Record the start time of the first scale-down event for each vertex separately. For example:
        • vertex1: 2024-08-09 01:35:02
        • vertex2: 2024-08-09 01:38:02
      • Scaling down will be triggered for some cases:
        • Any vertex needs scale up
          • Job restart cannot be avoided, so trigger scale down for another vertex as well if needed
          • After scale down, clean up the start time of scale-down.
        • The scale down lazy period for any vertex is coming
          • current time - min(start time for each vertex) > scale-down.lazy-period
          • This means that there is no scaling up during the scaling down lazy period

      Note1: If the recommend parallelism >= current parallelism, the start time of scale-down will be cleaned up for current vertex.

      Note2: The recommended parallelism still comes from the latest 15-minute metrics.For example:

      • The current parallelism of vertex1 is 100, the traffic is decreased at night.
      • 2024-08-09 01:00:00, the recommended parallelism is 60.
        • The start time of scale down is 2024-08-09 01:00:00.
      • 2024-08-09 01:15:00, the recommended parallelism is 50.
        • Still within the range of scale down lazy period.
        • Don't update the start time of scale down.
      • 2024-08-09 01:31:00, the recommended parallelism is 40.
        • Outside of scale-down.lazy-period, trigger rescale, and use 40 as the recommended parallelism.
        • The job.autoscaler.metrics.window is 15 min, so metrics from 2024-08-09 01:16:00 to 2024-08-09 01:31:00

      Note3: If users set job.autoscaler.scale-down.interval <=0, we scale down directly.

      Attachments

        Issue Links

          Activity

            People

              fanrui Rui Fan
              fanrui Rui Fan
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: