Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-30773

Add API for rescaling of jobs based on per-vertex parallelism overrides

    XMLWordPrintableJSON

Details

    Description

      FLINK-29501 introduced a way to rescale jobs via a user-provided parallelism overrides map. This feature is already used today by the Autoscaler of the Flink Kubernetes operator. However, it requires a full restart of the Flink job and only supports the application deployment mode.

      In a K8s environment, this is inefficient because all pods for a deployment will be surrendered. Upon restart, they have to be re-acquired. In addition to being slow, this can also lead to situations where resource constraints prevent a restart from executing properly.

      Ideally, we would would want the following to happen on receiving a rescale request:

      1. Rescale API receives a request with a parallelism overrides map (vertexId => parallelism) for a jobId
      2. Compute the number of required task slots using the overrides and the current JobGraph
        1. If the total number of task slots for the cluster is less than the required number of task slots of the rescale, acquire the missing task slots. Otherwise, do nothing
        2. Wait for new task slots to become available
        3. Abort rescale request on timeout
      3. Redeploy the JobGraph / Tasks with the updated parallelisms
      4. Surrender any unused task slots in case of scaling down

       

      There is an existing "Rescaling" API which is currently disabled. We have to evaluate whether reusing it makes sense.

      Attachments

        1. meces.patch
          767 kB
          Pedro Cardoso Silva

        Issue Links

          Activity

            People

              mxm Maximilian Michels
              mxm Maximilian Michels
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: