Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9073

GPU/FPGA whitelist configuration in container-executor.cfg won't work when yarn-site.xml's allowed devices doesn't align with it

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      The current GPU/FPGA behavior may has an issue when c-g.cfg doesn't align with yarn-site.xml. Take GPU for instance:

      One host has 1,2,3,4,5,6. And "GPU.allowed = 1,2,3" configured in c-e.cfg. But yarn-site.xml configured "auto" which means allow 1,2,3,4,5,6.

      And one application request 4 GPU, the scheduler allocated 1,2,4,5. So --excluded-gpus is "3". And c-e will check that 3 is in allowed list(1,2,3) and then only deny 3 in cgroups.

      In this case, c-e's allowed-list (1,2,3) doesn't work because the application can access 4,5,6 now.

      Attachments

        Issue Links

          Activity

            People

              tangzhankun Zhankun Tang
              tangzhankun Zhankun Tang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: