Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-1185

Small applications starve large ones in the same FIFO queue

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • core - scheduler
    • None

    Description

      Even when I set my queue to use a fifo application sort policy, applications that enter the queue later are able to run before applications that are submitted earlier; the queue does not behave like a first-in, first-out queue.

      Specifically, this happens when the later applications are smaller than the earlier ones. If enough small jobs applications are available in the queue to immediately fill any space that opens up, they will schedule as soon as space is available. YuniKorn doesn't wait for enough space to become free to schedule waiting large applications, no matter how much older they are than the things that are passing them in the queue.

      The result of this is that a steady supply of small applications can keep a larger application waiting indefinitely, causing starvation.

      The relevant code seems to be in Queue's tryAllocate method. YuniKorn goes through all the applications in the queue in order, and greedily schedules work items until no more fit. If no space large enough to fit any work form the first application currently exists, it will always fill what space there is with work from applications later in the queue. It will never wait to drain out space on a node to fit work from that first application.

      How can I configure or modify YuniKorn to prevent starvation, and make the applications in a queue execute in order, or at least not arbitrarily far out of order?

      (I already tried the stateaware queue sort, but it doesn't seem to work well with applications as small as mine. It appeared to run only one application at a time, because my applications finish so fast.)

      Replication

      First, have a Kubernetes cluster with a node k1.kube with 96 cores.

      Next, set up YuniKorn 0.12.2 with this values.yml for the Helm chart:

       

      embedAdmissionController: false
      configuration: |
        partitions:
          -
            name: default
            placementrules:
              - name: tag
                value: namespace
                create: true
            queues:
              - name: root
                submitacl: '*'
                childtemplate:
                 properties:
                   application.sort.policy: fifo 

      Then, run this script:

      #!/usr/bin/env bash
      # test-yunikorn.sh: Make sure YuniKorn prevents starvation
      set -e# Set this to annotate jobs other than the middle job
      OTHER_JOB_ANNOTATIONS=''
      # And similarly for the middle job
      MIDDLE_JOB_ANNOTATIONS=''# Where should we run?
      #NODE_SELECTOR='nodeSelector: {"kubernetes.io/hostname": "k1.kube"}'
      NODE_SELECTOR='affinity: {"nodeAffinity": {"requiredDuringSchedulingIgnoredDuringExecution": {"nodeSelectorTerms": [{"matchExpressions": [{"key": "kubernetes.io/hostname", "operator": "In", "values": ["k1.kube", "k2.kube", "k3.kube"]}]}]}}}'# How many 10-core jobs do we need to fill everywhere we will run?
      SCALE="30"# Clean up
      kubectl delete job -l app=yunikorntest || true# Make 10 core jobs that will block out our test job for at least 2 minutes
      # Make sure they don't all finish at once.
      rm -f jobs_before.yml
      for NUM in $(seq 1 ${SCALE}) ; do
      cat >>jobs_before.yml <<EOF
      apiVersion: batch/v1
      kind: Job
      metadata:
        name: presleep${NUM}
        labels:
          app: yunikorntest
        ${OTHER_JOB_ANNOTATIONS}
      spec:
        template:
          metadata:
            labels:
              app: yunikorntest
              applicationId: before-${NUM}
          spec:
            schedulerName: yunikorn
            ${NODE_SELECTOR}
            containers:
            - name: main
              image: ubuntu:20.04
              command: ["sleep",  "$(( $RANDOM % 20 + 120 ))"]
              resources:
                limits:
                  memory: 300M
                  cpu: 10000m
                  ephemeral-storage: 1G
                requests:
                  memory: 300M
                  cpu: 10000m
                  ephemeral-storage: 1G
            restartPolicy: Never
        ttlSecondsAfterFinished: 1000
      ---
      EOF
      done# How many jobs do we need to fill the cluster to compete against?
      COMPETING_JOBS="$((SCALE*20))"# And 10 core jobs that, if they all pass it, will keep it blocked out for 20 minutes
      # We expect it really to be blocked like 5-7-10 minutes if the SLA plugin is working.
      rm -f jobs_after.yml
      for NUM in $(seq 1 ${COMPETING_JOBS}) ; do
      cat >>jobs_after.yml <<EOF
      apiVersion: batch/v1
      kind: Job
      metadata:
        name: postsleep${NUM}
        labels:
          app: yunikorntest
        ${OTHER_JOB_ANNOTATIONS}
      spec:
        template:
          metadata:
            labels:
              app: yunikorntest
              applicationId: after-${NUM}
          spec:
            schedulerName: yunikorn
            ${NODE_SELECTOR}
            containers:
            - name: main
              image: ubuntu:20.04
              command: ["sleep",  "$(( $RANDOM % 20 + 60 ))"]
              resources:
                limits:
                  memory: 300M
                  cpu: 10000m
                  ephemeral-storage: 1G
                requests:
                  memory: 300M
                  cpu: 10000m
                  ephemeral-storage: 1G
            restartPolicy: Never
        ttlSecondsAfterFinished: 1000
      ---
      EOF
      done# And the test job itself between them.
      rm -f job_middle.yml
      cat >job_middle.yml <<EOF
      apiVersion: batch/v1
      kind: Job
      metadata:
        name: middle
        labels:
          app: yunikorntest
        ${MIDDLE_JOB_ANNOTATIONS}
      spec:
        template:
          metadata:
            labels:
              app: yunikorntest
              applicationId: middle
          spec:
            schedulerName: yunikorn
            ${NODE_SELECTOR}
            containers:
            - name: main
              image: ubuntu:20.04
              command: ["sleep", "1"]
              resources:
                limits:
                  memory: 300M
                  cpu: 50000m
                  ephemeral-storage: 1G
                requests:
                  memory: 300M
                  cpu: 50000m
                  ephemeral-storage: 1G
            restartPolicy: Never
        ttlSecondsAfterFinished: 1000
      EOFkubectl apply -f jobs_before.yml
      sleep 10
      kubectl apply -f job_middle.yml
      sleep 10
      CREATION_TIME="$(kubectl get job middle -o jsonpath='{.metadata.creationTimestamp}')"
      kubectl apply -f jobs_after.yml
      # Wait for it to finish
      echo "Waiting for middle job to finish..."
      COMPLETION_TIME=""
      while [[ -z "${COMPLETION_TIME}" ]] ; do
          sleep 10
          JOB_STATE="$(kubectl get job middle -o jsonpath='{.status.succeeded}' || true)"
          if [[ "${JOB_STATE}" == "1" ]] ; then
              COMPLETION_TIME="$(kubectl get job middle -o jsonpath='{.status.completionTime}' || true)"
          fi
      done
      echo "Test large job was created at ${CREATION_TIME} and completed at ${COMPLETION_TIME}"
      

      You will see that YuniKorn will run the vast majority of the "postsleep" jobs before allowing the "middle" job to schedule and run, even though the "middle" job was submitted to the queue first. By increasing the number of "postsleep" jobs submitted, you can starve the "middle" job for an arbitrarily long amount of time.

       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            adamnovak Adam Novak
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: