Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7777

Agent failed to recover due to mount namespace leakage in Docker 1.12/1.13

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.3, 1.2.2, 1.3.1, 1.4.0
    • Component/s: docker
    • Labels:
      None
    • Sprint:
      Mesosphere Sprint 59
    • Story Points:
      3

      Description

      Docker changed its default mount propagation to "shared" since 1.12 to enable persistent volume plugins. However, Docker has a known issue (https://github.com/moby/moby/issues/25718) that it sometimes leaks its mount namespace to other processes, which could make Mesos agents fail to remove Docker containers during recovery. The following shows the logs of such a faliure:

      I0615 09:39:11.083787  4573 docker.cpp:1002] Skipping recovery of executor 'kafka__7e49099d-7ab4-4435-a94a-1e849b8f2b70' of framework 44cbe3e9-984d-4073-b523-0023b427f54d-0011 because its executor is not marked as docker and the docker container doesn't exist
      Failed to perform recovery: Collect failed: Collect failed: Failed to run 'docker -H unix:///var/run/docker.sock rm -v 2de71c5383cb887f3ee49de5a517545b0522e1bbcb5df618c7ddb8583fd1d12d': exited with status 1; stderr='Error response from daemon: Driver overlay failed to remove root filesystem 2de71c5383cb887f3ee49de5a517545b0522e1bbcb5df618c7ddb8583fd1d12d: remove /var/lib/docker/overlay/221725ec545d60492b5431bb49380d868f7a949aaa3acff49f7ffb5bddeb3385/merged: device or resource busy
      '
      To remedy this do as follows:
      Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
      This ensures agent doesn't recover old live executors.
      Step 2: Restart the agent.
      

        Issue Links

          Activity

          Hide
          chhsia0 Chun-Hung Hsiao added a comment -

          Patch here for review: https://reviews.apache.org/r/60846/

          Show
          chhsia0 Chun-Hung Hsiao added a comment - Patch here for review: https://reviews.apache.org/r/60846/
          Hide
          chhsia0 Chun-Hung Hsiao added a comment -

          After discussed with Jie Yu, I created a new patch for backport first: https://reviews.apache.org/r/60879/,
          and will refactor the above patch and land it in the master without backporting.

          Show
          chhsia0 Chun-Hung Hsiao added a comment - After discussed with Jie Yu , I created a new patch for backport first: https://reviews.apache.org/r/60879/ , and will refactor the above patch and land it in the master without backporting.
          Hide
          jieyu Jie Yu added a comment -

          commit d4d75d32de8280eaa75fecd64b6c67db07299a5a
          Author: Chun-Hung Hsiao <chhsiao@mesosphere.io>
          Date: Fri Jul 14 13:57:24 2017 -0700

          Preventing agent recovery failing from unsuccessful `docker rm`.

          This patch changes the semancits of `Docker::stop()` to do a best-effort
          `docker rm` and log the error instead of returning the failure, as a
          workaround of the mount namespace leakage issue in Docker 1.12/1.13.

          Review: https://reviews.apache.org/r/60879/

          Show
          jieyu Jie Yu added a comment - commit d4d75d32de8280eaa75fecd64b6c67db07299a5a Author: Chun-Hung Hsiao <chhsiao@mesosphere.io> Date: Fri Jul 14 13:57:24 2017 -0700 Preventing agent recovery failing from unsuccessful `docker rm`. This patch changes the semancits of `Docker::stop()` to do a best-effort `docker rm` and log the error instead of returning the failure, as a workaround of the mount namespace leakage issue in Docker 1.12/1.13. Review: https://reviews.apache.org/r/60879/
          Show
          chhsia0 Chun-Hung Hsiao added a comment - Patch refactored: https://reviews.apache.org/r/60887/ https://reviews.apache.org/r/60846/
          Hide
          gilbert Gilbert Song added a comment -

          Chun-Hung Hsiao Jie Yu I am closing this JIRA for the workaround. Let's create another JIRA to track on the refactoring and retry logic.

          Show
          gilbert Gilbert Song added a comment - Chun-Hung Hsiao Jie Yu I am closing this JIRA for the workaround. Let's create another JIRA to track on the refactoring and retry logic.

            People

            • Assignee:
              chhsia0 Chun-Hung Hsiao
              Reporter:
              chhsia0 Chun-Hung Hsiao
              Shepherd:
              Jie Yu
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development

                  Agile