Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-5239

Persistent volume DockerContainerizer support assumes proper mount propagation setup on the host.

    XMLWordPrintableJSON

Details

    Description

      We recently added persistent volume support in DockerContainerizer (MESOS-3413). To understand the problem, we first need to understand how persistent volumes are supported in DockerContainerizer.

      To support persistent volumes in DockerContainerizer, we bind mount persistent volumes under a container's sandbox ('container_path' has to be relative for persistent volumes). When the Docker container is launched, since we always add a volume (-v) for the sandbox, the persistent volumes will be bind mounted into the container as well (since Docker does a 'rbind').

      The assumption that the above works is that the Docker daemon should see those persistent volume mounts that Mesos mounts on the host mount table. It's not a problem if Docker daemon itself is using the host mount namespace. However, on systemd enabled systems, Docker daemon is running in a separate mount namespace and all mounts in that mount namespace will be marked as slave mounts due to this patch.

      So what that means is that: in order for it to work, the parent mount of agent's work_dir should be a shared mount when docker daemon starts. This is typically true on CentOS7, CoreOS as all mounts are shared mounts by default.

      However, this causes an issue with the 'filesystem/linux' isolator. To understand why, first I need to show you a typical problem when dealing with shared mounts. Let me explain that using the following commands on a CentOS7 machine:

      [root@core-dev run]# cat /proc/self/mountinfo
      24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
      [root@core-dev run]# mkdir /run/netns
      [root@core-dev run]# mount --bind /run/netns /run/netns
      [root@core-dev run]# cat /proc/self/mountinfo
      24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
      121 24 0:19 /netns /run/netns rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
      [root@core-dev run]# ip netns add test
      [root@core-dev run]# cat /proc/self/mountinfo
      24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
      121 24 0:19 /netns /run/netns rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
      162 121 0:3 / /run/netns/test rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw
      163 24 0:3 / /run/netns/test rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw
      

      As you can see above, there're two entries (/run/netns/test) in the mount table (unexpected). This will confuse some systems sometimes. The reason is because when we create a self bind mount (/run/netns -> /run/netns), the mount will be put into the same shared mount peer group (shared:22) as its parent (/run). Then, when you create another mount underneath that (/run/netns/test), that mount operation will be propagated to all mounts in the same peer group (shared:22), resulting an unexpected additional mount being created.

      The reason we need to do a self bind mount in Mesos is that sometimes, we need to make sure some mounts are shared so that it does not get copied when a new mount namespace is created. However, on some systems, mounts are private by default (e.g., Ubuntu 14.04). In those cases, since we cannot change the system mounts, we have to do a self bind mount so that we can set mount propagation to shared. For instance, in filesytem/linux isolator, we do a self bind mount on agent's work_dir.

      To avoid the self bind mount pitfall mentioned above, in filesystem/linux isolator, after we created the mount, we do a make-slave + make-shared so that the mount is its own shared mount peer group. In that way, any mounts underneath it will not be propagated back.

      However, that operation will break the assumption that the persistent volume DockerContainerizer support makes. As a result, we're seeing problem with persistent volumes in DockerContainerizer when filesystem/linux isolator is turned on.

      Attachments

        Activity

          People

            jieyu Jie Yu
            jieyu Jie Yu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: