Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-2070

E2e tests for gang_scheduling failed due to containers init were OOM-Killed

    XMLWordPrintableJSON

Details

    Description

      Recently we encountered several gang scheduling errors in CI e2e test, all of the failures are waiting for the creation of placeholders(with 10M memory limit). However, some placeholders are failed with below OOM-killed error:

      “Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?): unknown” 

      The root cause might be the varying memory peak when OCI runtime create multiple containers. We can try to change placeholder memory limit from 10M to 20M in e2e test. (Sleep jobs are using 20M memory.)

      List some failed e2e test in last 3 weeks:

      1. (Link) Target 15 placeholder, 14 created 1 OOM-Killed.
      2. (Link) Target 3 placeholder, 2 created 1 OOM-Killed.
      3. (Link) Target 3 placeholder, 2 created 1 OOM-Killed.
      4. (Link) Target 15 placeholder, 11 created 4 OOM-Killed.
         

      Attachments

        Issue Links

          Activity

            People

              Yu-Lin Chen Yu-Lin Chen
              Yu-Lin Chen Yu-Lin Chen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: