Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.22.1
    • Component/s: agent
    • Labels:
      None

      Description

      Tasks occasionally become stuck in the `TASK_STAGING` state after launching. It appears that this affects both Docker and non-Docker tasks, especially those which start up and fail immediately. Attached is a sample of the slave log as well as screenshots from a testing cluster showing the tasks which are stuck in staging, and then a number of failed tasks which occurs after restarting the slave process. Justin Bieber is provided for scale.

      This may be related to MESOS-1837, and quite possibly the same issue, but it remains unclear.

      1. Justin-Bieber_The-Beliebers-Want-to-Believe-2-650x406.jpg
        76 kB
        Brenden Matthews
      2. log.txt
        20 kB
        Brenden Matthews
      3. Screen Shot 2015-03-26 at 11.59.33 AM.png
        293 kB
        Brenden Matthews
      4. Screen Shot 2015-03-30 at 2.04.14 PM.png
        223 kB
        Brenden Matthews

        Issue Links

          Activity

          Hide
          jieyu Jie Yu added a comment -

          Might be related to MESOS-998 and this patch:
          https://reviews.apache.org/r/31024

          Show
          jieyu Jie Yu added a comment - Might be related to MESOS-998 and this patch: https://reviews.apache.org/r/31024
          Hide
          jieyu Jie Yu added a comment -

          Looking at your log, seem that the container "d4302815-482f-4c26-b2a9-f34b7c032dc9" is being destroyed and we saw "Running docker stop on container 'd4302815-482f-4c26-b2a9-f34b7c032dc9'". However, the slave never enters 'executorTerminated'.

          Do you have log for Mesos containerizer?

          Show
          jieyu Jie Yu added a comment - Looking at your log, seem that the container "d4302815-482f-4c26-b2a9-f34b7c032dc9" is being destroyed and we saw "Running docker stop on container 'd4302815-482f-4c26-b2a9-f34b7c032dc9'". However, the slave never enters 'executorTerminated'. Do you have log for Mesos containerizer?
          Hide
          jieyu Jie Yu added a comment -

          I mean non-Docker tasks.

          Show
          jieyu Jie Yu added a comment - I mean non-Docker tasks.
          Hide
          brenden Brenden Matthews added a comment -

          Not at the moment, but if I manage to witness it, I will be sure to attach the log here.

          Show
          brenden Brenden Matthews added a comment - Not at the moment, but if I manage to witness it, I will be sure to attach the log here.
          Hide
          jieyu Jie Yu added a comment -

          Vinod and I triaged it, from the log you pasted, we don't understand why 'docker stop' does not trigger executorTerminated. Without that, the slave won't send TASK_LOST. cc Timothy Chen

          Show
          jieyu Jie Yu added a comment - Vinod and I triaged it, from the log you pasted, we don't understand why 'docker stop' does not trigger executorTerminated. Without that, the slave won't send TASK_LOST. cc Timothy Chen
          Hide
          jvanremoortere Joris Van Remoortere added a comment -

          container "d4302815-482f-4c26-b2a9-f34b7c032dc9" is associated with the canary executor, not the hdfs-canary executor though, right?
          Is there a relationship between them that i'm not understanding?

          Show
          jvanremoortere Joris Van Remoortere added a comment - container "d4302815-482f-4c26-b2a9-f34b7c032dc9" is associated with the canary executor, not the hdfs-canary executor though, right? Is there a relationship between them that i'm not understanding?
          Hide
          brenden Brenden Matthews added a comment -

          Those are indeed 2 distinct tasks.

          Show
          brenden Brenden Matthews added a comment - Those are indeed 2 distinct tasks.
          Hide
          jieyu Jie Yu added a comment -

          d4302815-482f-4c26-b2a9-f34b7c032dc9 is associated with task ct:1427384616000:0:canary:

          the hdfs-canary is irrelevant.

          Show
          jieyu Jie Yu added a comment - d4302815-482f-4c26-b2a9-f34b7c032dc9 is associated with task ct:1427384616000:0:canary: the hdfs-canary is irrelevant.
          Hide
          jvanremoortere Joris Van Remoortere added a comment -
          Show
          jvanremoortere Joris Van Remoortere added a comment - For the hdfs-canary example it seems we are stopping execution somewhere between: START) https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1516 which is dispatching to : https://github.com/apache/mesos/blob/master/src/exec/exec.cpp#L280 and STOP) https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L146
          Hide
          tnachen Timothy Chen added a comment -

          It should indeed trigger it as the task should have been stopped and the executor will have exited on the docker wait.
          I'll look into it more with Brenden to see what's going on.

          Show
          tnachen Timothy Chen added a comment - It should indeed trigger it as the task should have been stopped and the executor will have exited on the docker wait. I'll look into it more with Brenden to see what's going on.
          Hide
          tnachen Timothy Chen added a comment -

          I think I understand what's going on, I think this most likely just affects the Docker containerizer.
          When a task is launched detached and fails before we launch the executor, the subsequent update call to update the resources fails as the docker container isn't running as we try to find the pid and it won't be able to find the cgroups path as it was removed.
          However, the executor that was launched to run 'docker wait container-id' was still waiting a RunTaskMessage to be called for it to start docker-wait, and it just sits there waiting for a RunTaskMessage to happen, while in the slave if we cannot update the containerizer we simply call destroy on the containerizer and trust that the executor will clean itself up.
          I think the fix for this is probably two folds:

          • I think we shouldn't fail update if the docker container exits, which means we should not just return Failure. I think what we could do is to perform an extra os::exists check when cgroups update call failed just to verify that the pid exited, and if it doesn't exist we return Nothing() instead.
          • The executor that Docker containerizer launched should get removed by the containerizer->destroy to ensure we don't keep idle executors around. This should be fixed in the future where we move docker->run right inside of the executor, so it will remove itself when the container dies.
          Show
          tnachen Timothy Chen added a comment - I think I understand what's going on, I think this most likely just affects the Docker containerizer. When a task is launched detached and fails before we launch the executor, the subsequent update call to update the resources fails as the docker container isn't running as we try to find the pid and it won't be able to find the cgroups path as it was removed. However, the executor that was launched to run 'docker wait container-id' was still waiting a RunTaskMessage to be called for it to start docker-wait, and it just sits there waiting for a RunTaskMessage to happen, while in the slave if we cannot update the containerizer we simply call destroy on the containerizer and trust that the executor will clean itself up. I think the fix for this is probably two folds: I think we shouldn't fail update if the docker container exits, which means we should not just return Failure. I think what we could do is to perform an extra os::exists check when cgroups update call failed just to verify that the pid exited, and if it doesn't exist we return Nothing() instead. The executor that Docker containerizer launched should get removed by the containerizer->destroy to ensure we don't keep idle executors around. This should be fixed in the future where we move docker->run right inside of the executor, so it will remove itself when the container dies.
          Hide
          jieyu Jie Yu added a comment -

          the executor that was launched to run 'docker wait container-id' was still waiting a RunTaskMessage to be called for it to start docker-wait

          That sounds like an implementation limitation.

          Show
          jieyu Jie Yu added a comment - the executor that was launched to run 'docker wait container-id' was still waiting a RunTaskMessage to be called for it to start docker-wait That sounds like an implementation limitation.
          Hide
          tnachen Timothy Chen added a comment -

          I'm still thinking what's the right fix, as indeed sounds like we shouldn't just let the executor keep running. It feels like we should consider asking the executor to shutdown if the executor is still present and running, since we're not going to be sending anything to it anymore.

          Show
          tnachen Timothy Chen added a comment - I'm still thinking what's the right fix, as indeed sounds like we shouldn't just let the executor keep running. It feels like we should consider asking the executor to shutdown if the executor is still present and running, since we're not going to be sending anything to it anymore.
          Hide
          jieyu Jie Yu added a comment -

          I think containerizer->destroy should properly handle all cases, killing both the docker container (docker stop) and the corresponding nanny executor (even if it hasn't received the first task).

          Show
          jieyu Jie Yu added a comment - I think containerizer->destroy should properly handle all cases, killing both the docker container (docker stop) and the corresponding nanny executor (even if it hasn't received the first task).
          Show
          tnachen Timothy Chen added a comment - https://reviews.apache.org/r/32796/ https://reviews.apache.org/r/32797/ https://reviews.apache.org/r/32798/
          Hide
          brenden Brenden Matthews added a comment -

          Tim's patches are up in our test cluster for verification.

          Show
          brenden Brenden Matthews added a comment - Tim's patches are up in our test cluster for verification.
          Hide
          tnachen Timothy Chen added a comment -

          commit 00318fc1b30fc0961c2dfa4d934c37866577d801
          Author: Timothy Chen <tnachen@apache.org>
          Date: Wed Apr 1 16:56:53 2015 -0700

          Add test to verify executor clean up in docker containerizer.

          Review: https://reviews.apache.org/r/32798

          commit 879044096f80adb1799ce43acc1fc5ae58dfac69
          Author: Timothy Chen <tnachen@apache.org>
          Date: Wed Apr 1 16:29:54 2015 -0700

          Kill the executor when docker container is destroyed.

          Review: https://reviews.apache.org/r/32797

          commit 591d1090f69a7d9197aea355488c973c41881be7
          Author: Timothy Chen <tnachen@apache.org>
          Date: Wed Apr 1 16:03:43 2015 -0700

          Only update docker container when resources differs.

          Review: https://reviews.apache.org/r/32796

          Show
          tnachen Timothy Chen added a comment - commit 00318fc1b30fc0961c2dfa4d934c37866577d801 Author: Timothy Chen <tnachen@apache.org> Date: Wed Apr 1 16:56:53 2015 -0700 Add test to verify executor clean up in docker containerizer. Review: https://reviews.apache.org/r/32798 commit 879044096f80adb1799ce43acc1fc5ae58dfac69 Author: Timothy Chen <tnachen@apache.org> Date: Wed Apr 1 16:29:54 2015 -0700 Kill the executor when docker container is destroyed. Review: https://reviews.apache.org/r/32797 commit 591d1090f69a7d9197aea355488c973c41881be7 Author: Timothy Chen <tnachen@apache.org> Date: Wed Apr 1 16:03:43 2015 -0700 Only update docker container when resources differs. Review: https://reviews.apache.org/r/32796

            People

            • Assignee:
              tnachen Timothy Chen
              Reporter:
              brenden Brenden Matthews
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development