Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-2684

mesos-slave should not abort when a single task has e.g. a 'mkdir' failure

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Accepted
    • Major
    • Resolution: Unresolved
    • 0.21.1
    • None
    • agent, docker
    • None

    Description

      mesos-slave can encounter a variety of problems while attempting to launch a task. If the task fails, that is unfortunate, but not the end of the world. Other tasks should not be affected.

      However, if the task failure happens to trigger an assertion, the entire slave comes crashing down:

      F0501 19:10:46.095464 1705 paths.hpp:342] CHECK_SOME(mkdir): No space left on device Failed to create executor directory '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01'

      Immediately afterwards, all tasks on this slave were declared TASK_KILLED when mesos-slave restarted.

      Something as simple as a 'mkdir' failing is not worthy of an assertion failure.

      Attachments

        1. mesos-slave-restart.txt
          12 kB
          Steven Schlansker

        Issue Links

          Activity

            People

              Unassigned Unassigned
              stevenschlansker Steven Schlansker
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: