Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8444

GC failure causes agent miss to detach virtual paths for the executor's sandbox

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.5.0, 1.6.0
    • agent
    • None
    • Mesosphere Sprint 72
    • 2

    Description

      I launched a task via mesos-execute which just did a sleep 10, when the task finished, Slave::removeExecutor() and Slave::removeFramework() were called and they will try to gc 3 directories:

      1. /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID>/runs/<containerID>
      2. /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>/executors/<executorID>
      3. /<slave-work-dir>/slaves/<slaveID>/frameworks/<frameworkID>

      For 1 and 2, the code to gc them is like this:

        garbageCollect(path)
          .then(defer(self(), &Self::detachFile, path));
      

      So here then() is used which means we will only do the detach when the gc succeeds. But the problem is the order of 1, 2 and 3 deleted by gc can not be guaranteed, from my test, 3 will be deleted first for most of times. Since 3 is the parent directory of 1 and 2, so the gc for 1 and 2 will fail:

      I0111 00:19:33.001655 42889 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000
      I0111 00:19:33.002576 42889 gc.cpp:218] Deleted '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000'
      I0111 00:19:33.004551 42893 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15
      W0111 00:19:33.004622 42893 gc.cpp:212] Failed to delete '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor/runs/b067936a-f4c4-4091-b786-4dd4d4d6da15': No such file or directory
      I0111 00:19:33.006367 42923 gc.cpp:208] Deleting /home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor
      W0111 00:19:33.006466 42923 gc.cpp:212] Failed to delete '/home/qzhang/opt/mesos/slaves/9dea9207-5730-4f7a-b9a5-f772e035253b-S0/frameworks/c6f6659d-a402-41e3-891a-aaaa0c887a3b-0000/executors/default-executor': No such file or directory
      

      So we will NOT do the detach for 1 and 2 which is a leak.

      Attachments

        Issue Links

          Activity

            People

              qianzhang Qian Zhang
              qianzhang Qian Zhang
              Vinod Kone Vinod Kone
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: