Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8038

Launching GPU task sporadically fails.

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Reviewable
    • Critical
    • Resolution: Unresolved
    • 1.4.0
    • None
    • containerization, gpu
    • None

    Description

      I was running a job which uses GPUs. It runs fine most of the time.
      But occasionally I see the following message in the mesos log.
      "Collect failed: Requested 1 but only 0 available"
      Followed by executor getting killed and the tasks getting lost. This happens even before the the job starts. A little search in the code base points me to something related to GPU resource being the probable cause.

      There is no deterministic way that this can be reproduced. It happens occasionally.
      I have attached the slave log for the issue.

      Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.

      Attachments

        1. mesos_agent.log
          554 kB
          Charles Natali
        2. mesos-master.log
          968 kB
          Sai Teja Ranuva
        3. mesos-slave.INFO.log
          213 kB
          Sai Teja Ranuva
        4. mesos-slave-with-issue-uber.txt
          3.81 MB
          Zhitao Li
        5. start_short_tasks_gpu.py
          2 kB
          Charles Natali

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            zhitao Zhitao Li
            saitejar Sai Teja Ranuva
            Gilbert Song Gilbert Song

            Dates

              Created:
              Updated:

              Slack

                Issue deployment