Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-1781

Sandbox taskfs setup fails (groupadd error)

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.16.0
    • Fix Version/s: None
    • Component/s: Docker, Executor
    • Labels:
      None

      Description

      I hit what smells like a permission issue w/ `/etc/group` when trying to use a docker-image (unified containerizer setup) with mesos-1.0.0. and aurora-0.16.0-rc2. I cannot reproduce issue w/ mesos-0.28.2 and aurora-015.0.

      Failed to initialize sandbox: Failed to create group in sandbox for task image: Command '['groupadd', '-R', '/var/lib/mesos/slaves/5d28d0cc-2793-4471-82d5-e67276c53f70-S2/frameworks/20160221-001235-3801519626-5050-1-0000/executors/thermos-nobody-prod-jenkins-0-47cc7824-565b-4265-9ab4-9ba3f364ebed/runs/a3f78288-4865-4166-8685-1ad941562f2f/taskfs', '-g', '99', 'nobody']' returned non-zero exit status 10
      
      [root@mesos-master01of2 taskfs]# pwd
      /var/lib/mesos/slaves/5d28d0cc-2793-4471-82d5-e67276c53f70-S2/frameworks/20160221-001235-3801519626-5050-1-0000/executors/thermos-nobody-prod-jenkins-0-47cc7824-565b-4265-9ab4-9ba3f364ebed/runs/a3f78288-4865-4166-8685-1ad941562f2f/taskfs
      [root@mesos-master01of2 taskfs]# groupadd -R $PWD -g 99 nobody
      groupadd: cannot lock /etc/group; try again later.
      

      Maybe related to AURORA-1761

      I'm running CoreOS with the mesos-agent (and thermos) inside docker. Here is the gist of how it's started.

      /usr/bin/sh -c "exec /usr/bin/docker run \
          --name=mesos_slave \
          --net=host \
          --pid=host \
          --privileged \
          -v /sys:/sys \
          -v /usr/bin/docker:/usr/bin/docker:ro \
          -v /var/lib/docker:/var/lib/docker \
          -v /var/run/docker.sock:/root/docker.sock \
          -v /run/systemd/system:/run/systemd/system \
          -v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \
          -v /sys/fs/cgroup:/sys/fs/cgroup \
          -v /var/lib/mesos:/var/lib/mesos \
          -e MESOS_CONTAINERIZERS=docker,mesos \
          -e MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins \
          -e MESOS_WORK_DIR=/var/lib/mesos \
          -e MESOS_LOGGING_LEVEL=INFO \
          -e AMAZON_REGION=us-office-2 \
          -e AVAILABILITY_ZONE=us-office-2b \
          -e MESOS_ATTRIBUTES=\"platform:linux;host:$(hostname);rack:us-office-2b\" \
          -e MESOS_CLUSTER=ZeroZero \
          -e MESOS_DOCKER_SOCKET=/root/docker.sock \
          -e MESOS_MASTER=zk://10.150.150.224:2181,10.150.150.225:2181,10.150.150.226:2181/mesos \
          -e MESOS_LOG_DIR=/var/log/mesos \
          -e MESOS_ISOLATION=\"filesystem/linux,cgroups/cpu,cgroups/mem,docker/runtime\" \
          -e MESOS_IMAGE_PROVIDERS=docker \
          -e MESOS_IMAGE_PROVISIONER_BACKEND=copy \
          -e MESOS_DOCKER_REGISTRY=http://docker-registry:31000 \
          -e MESOS_DOCKER_STORE_DIR=/var/lib/mesos/docker \
          --entrypoint=/usr/sbin/mesos-slave \
          docker-registry.thebrighttag.com:31000/mesos:latest \
              --no-systemd_enable_support \
          || rm -f /var/lib/mesos/meta/slaves/latest"
      

        Activity

        Hide
        jvenus Justin Venus added a comment -

        I can work around the issue with "--no-create-user" for the time being.

        Show
        jvenus Justin Venus added a comment - I can work around the issue with "--no-create-user" for the time being.
        Hide
        a-nldisr Rogier Dikkes added a comment -

        Same issue:
        OS: CentOS Linux release 7.2.1511 (Core)
        version Aurora:
        0.16.0.
        version Mesos:
        Version: 1.0.1

        Used the hello_docker_image.aurora as a test from https://github.com/apache/aurora/tree/master/examples/jobs

        I created the aurora rpm from the aurora-packaging repository and used the 0.16.0 source distribution to create all packages.

        The error:
        8 minutes ago - FAILED : Failed to initialize sandbox: Failed to create group in sandbox for task image: Command '['groupadd', '-R', '/var/lib/mesos/slaves/ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074/frameworks/ab28b3ed-85d1-4bce-898e-e57a5f332762-0000/executors/thermos-blauser-prod-hello_docker_image-0-f8232fb7-be9c-4910-bbb8-136ba369ce3f/runs/8bddc079-9a6d-4047-afe6-d4969dad2d4d/taskfs', '-g', '1000', 'blauser']' returned non-zero exit status 10

        When using the vagrant image i did not run into this issue.

        What is in the mesos log:
        I1004 18:07:38.698328 108146 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074\/root","items":[{"action":"BYPASS_CACHE","uri":{"executable":true,"extract":true,"value":"\/usr\/bin\/thermos_executor"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074\/frameworks\/ab28b3ed-85d1-4bce-898e-e57a5f332762-0000\/executors\/thermos-blauser-prod-hello_docker_image-0-0639d3f6-5fab-4154-bef6-304d82a26de1\/runs\/831a4a74-6053-42df-b830-77660e5125c5","user":"root"}
        I1004 18:07:38.703634 108146 fetcher.cpp:409] Fetching URI '/usr/bin/thermos_executor'
        I1004 18:07:38.703665 108146 fetcher.cpp:250] Fetching directly into the sandbox directory
        I1004 18:07:38.703697 108146 fetcher.cpp:187] Fetching URI '/usr/bin/thermos_executor'
        I1004 18:07:38.703718 108146 fetcher.cpp:167] Copying resource with command:cp '/usr/bin/thermos_executor' '/var/lib/mesos/slaves/ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074/frameworks/ab28b3ed-85d1-4bce-898e-e57a5f332762-0000/executors/thermos-blauser-prod-hello_docker_image-0-0639d3f6-5fab-4154-bef6-304d82a26de1/runs/831a4a74-6053-42df-b830-77660e5125c5/thermos_executor'
        I1004 18:07:38.718241 108146 fetcher.cpp:547] Fetched '/usr/bin/thermos_executor' to '/var/lib/mesos/slaves/ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074/frameworks/ab28b3ed-85d1-4bce-898e-e57a5f332762-0000/executors/thermos-blauser-prod-hello_docker_image-0-0639d3f6-5fab-4154-bef6-304d82a26de1/runs/831a4a74-6053-42df-b830-77660e5125c5/thermos_executor'
        twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
        Writing log files to disk in /var/lib/mesos/slaves/ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074/frameworks/ab28b3ed-85d1-4bce-898e-e57a5f332762-0000/executors/thermos-blauser-prod-hello_docker_image-0-0639d3f6-5fab-4154-bef6-304d82a26de1/runs/831a4a74-6053-42df-b830-77660e5125c5
        I1004 18:07:39.536164 108143 exec.cpp:161] Version: 1.0.0
        I1004 18:07:39.548815 108199 exec.cpp:236] Executor registered on agent ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074
        groupadd: failure while writing changes to /etc/group
        FATAL] Failed to initialize sandbox: Failed to create group in sandbox for task image: Command '['groupadd', '-R', '/var/lib/mesos/slaves/ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074/frameworks/ab28b3ed-85d1-4bce-898e-e57a5f332762-0000/executors/thermos-blauser-prod-hello_docker_image-0-0639d3f6-5fab-4154-bef6-304d82a26de1/runs/831a4a74-6053-42df-b830-77660e5125c5/taskfs', '-g', '1000', 'blauser']' returned non-zero exit status 10
        twitter.common.app debug: Shutting application down.
        twitter.common.app debug: Running exit function for twitter.common.log (Logging subsystem.)
        twitter.common.app debug: Finishing up module teardown.
        twitter.common.app debug: Active thread: <_MainThread(MainThread, started 140211855935296)>
        twitter.common.app debug: Active thread (daemon): <_DummyThread(Dummy-2, started daemon 140211681986304)>
        twitter.common.app debug: Exiting cleanly.

        Show
        a-nldisr Rogier Dikkes added a comment - Same issue: OS: CentOS Linux release 7.2.1511 (Core) version Aurora: 0.16.0. version Mesos: Version: 1.0.1 Used the hello_docker_image.aurora as a test from https://github.com/apache/aurora/tree/master/examples/jobs I created the aurora rpm from the aurora-packaging repository and used the 0.16.0 source distribution to create all packages. The error: 8 minutes ago - FAILED : Failed to initialize sandbox: Failed to create group in sandbox for task image: Command ' ['groupadd', '-R', '/var/lib/mesos/slaves/ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074/frameworks/ab28b3ed-85d1-4bce-898e-e57a5f332762-0000/executors/thermos-blauser-prod-hello_docker_image-0-f8232fb7-be9c-4910-bbb8-136ba369ce3f/runs/8bddc079-9a6d-4047-afe6-d4969dad2d4d/taskfs', '-g', '1000', 'blauser'] ' returned non-zero exit status 10 When using the vagrant image i did not run into this issue. What is in the mesos log: I1004 18:07:38.698328 108146 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074\/root","items": [{"action":"BYPASS_CACHE","uri":{"executable":true,"extract":true,"value":"\/usr\/bin\/thermos_executor"}}] ,"sandbox_directory":"\/var\/lib\/mesos\/slaves\/ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074\/frameworks\/ab28b3ed-85d1-4bce-898e-e57a5f332762-0000\/executors\/thermos-blauser-prod-hello_docker_image-0-0639d3f6-5fab-4154-bef6-304d82a26de1\/runs\/831a4a74-6053-42df-b830-77660e5125c5","user":"root"} I1004 18:07:38.703634 108146 fetcher.cpp:409] Fetching URI '/usr/bin/thermos_executor' I1004 18:07:38.703665 108146 fetcher.cpp:250] Fetching directly into the sandbox directory I1004 18:07:38.703697 108146 fetcher.cpp:187] Fetching URI '/usr/bin/thermos_executor' I1004 18:07:38.703718 108146 fetcher.cpp:167] Copying resource with command:cp '/usr/bin/thermos_executor' '/var/lib/mesos/slaves/ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074/frameworks/ab28b3ed-85d1-4bce-898e-e57a5f332762-0000/executors/thermos-blauser-prod-hello_docker_image-0-0639d3f6-5fab-4154-bef6-304d82a26de1/runs/831a4a74-6053-42df-b830-77660e5125c5/thermos_executor' I1004 18:07:38.718241 108146 fetcher.cpp:547] Fetched '/usr/bin/thermos_executor' to '/var/lib/mesos/slaves/ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074/frameworks/ab28b3ed-85d1-4bce-898e-e57a5f332762-0000/executors/thermos-blauser-prod-hello_docker_image-0-0639d3f6-5fab-4154-bef6-304d82a26de1/runs/831a4a74-6053-42df-b830-77660e5125c5/thermos_executor' twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.) Writing log files to disk in /var/lib/mesos/slaves/ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074/frameworks/ab28b3ed-85d1-4bce-898e-e57a5f332762-0000/executors/thermos-blauser-prod-hello_docker_image-0-0639d3f6-5fab-4154-bef6-304d82a26de1/runs/831a4a74-6053-42df-b830-77660e5125c5 I1004 18:07:39.536164 108143 exec.cpp:161] Version: 1.0.0 I1004 18:07:39.548815 108199 exec.cpp:236] Executor registered on agent ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074 groupadd: failure while writing changes to /etc/group FATAL] Failed to initialize sandbox: Failed to create group in sandbox for task image: Command ' ['groupadd', '-R', '/var/lib/mesos/slaves/ab28b3ed-85d1-4bce-898e-e57a5f332762-S2074/frameworks/ab28b3ed-85d1-4bce-898e-e57a5f332762-0000/executors/thermos-blauser-prod-hello_docker_image-0-0639d3f6-5fab-4154-bef6-304d82a26de1/runs/831a4a74-6053-42df-b830-77660e5125c5/taskfs', '-g', '1000', 'blauser'] ' returned non-zero exit status 10 twitter.common.app debug: Shutting application down. twitter.common.app debug: Running exit function for twitter.common.log (Logging subsystem.) twitter.common.app debug: Finishing up module teardown. twitter.common.app debug: Active thread: <_MainThread(MainThread, started 140211855935296)> twitter.common.app debug: Active thread (daemon): <_DummyThread(Dummy-2, started daemon 140211681986304)> twitter.common.app debug: Exiting cleanly.
        Hide
        a-nldisr Rogier Dikkes added a comment -

        I disabled SELINUX on a couple of hosts and since then this issue has gone away. The issue still persists on 1 host after selinux is disabled and i suspect some security setting to cause this.

        Show
        a-nldisr Rogier Dikkes added a comment - I disabled SELINUX on a couple of hosts and since then this issue has gone away. The issue still persists on 1 host after selinux is disabled and i suspect some security setting to cause this.
        Hide
        joshua.cohen Joshua Cohen added a comment -

        Justin Venus Trying to find a common thread, do you have SELINUX enabled on the hosts where you originally saw this problem?

        Show
        joshua.cohen Joshua Cohen added a comment - Justin Venus Trying to find a common thread, do you have SELINUX enabled on the hosts where you originally saw this problem?
        Hide
        jvenus Justin Venus added a comment -

        Joshua Cohen yes selinux is enabled.

        CoreOS stable (1068.9.0)
        Last login: Wed Oct  5 14:20:03 2016 from 10.111.254.195
        Update Strategy: No Reboots
        jvenus@mesos-slave03of2 ~ $ sestatus
        SELinux status:                 enabled
        SELinuxfs mount:                /sys/fs/selinux
        SELinux root directory:         /etc/selinux
        Loaded policy name:             mcs
        Current mode:                   permissive
        Mode from config file:          permissive
        Policy MLS status:              enabled
        Policy deny_unknown status:     allowed
        Max kernel policy version:      30
        
        Show
        jvenus Justin Venus added a comment - Joshua Cohen yes selinux is enabled. CoreOS stable (1068.9.0) Last login: Wed Oct 5 14:20:03 2016 from 10.111.254.195 Update Strategy: No Reboots jvenus@mesos-slave03of2 ~ $ sestatus SELinux status: enabled SELinuxfs mount: /sys/fs/selinux SELinux root directory: /etc/selinux Loaded policy name: mcs Current mode: permissive Mode from config file: permissive Policy MLS status: enabled Policy deny_unknown status: allowed Max kernel policy version: 30
        Hide
        kr0t Kostiantyn Bokhan added a comment -

        The same issue:
        CentOS Linux release 7.2.1511 (Core)
        Mesos: 1.1.0
        Aurora: 0.16.0
        SELinux: disabled

        Show
        kr0t Kostiantyn Bokhan added a comment - The same issue: CentOS Linux release 7.2.1511 (Core) Mesos: 1.1.0 Aurora: 0.16.0 SELinux: disabled
        Hide
        pvcnt Vincent Primault added a comment -

        I have the exact same issue.
        Ubuntu: 14.04
        Mesos: 1.1.0
        Aurora: 0.16.0
        SELinux: disabled

        Also used the --no-create-user option as a workaround.

        Show
        pvcnt Vincent Primault added a comment - I have the exact same issue. Ubuntu: 14.04 Mesos: 1.1.0 Aurora: 0.16.0 SELinux: disabled Also used the --no-create-user option as a workaround.
        Hide
        StephanErb Stephan Erb added a comment -

        Joshua Cohen or anyone else an idea what could be the problem here? This is currently marked as a blocker for 0.17 but we don't have a fix in sight atm.

        Show
        StephanErb Stephan Erb added a comment - Joshua Cohen or anyone else an idea what could be the problem here? This is currently marked as a blocker for 0.17 but we don't have a fix in sight atm.
        Hide
        joshua.cohen Joshua Cohen added a comment -

        I don't have any insight into the root cause here. Without being able to reproduce, it's hard to diagnose.

        That said, given that there's a workaround in using the --no-create-user flag to the executor, I don't think this should block the 0.17.0 release.

        Show
        joshua.cohen Joshua Cohen added a comment - I don't have any insight into the root cause here. Without being able to reproduce, it's hard to diagnose. That said, given that there's a workaround in using the --no-create-user flag to the executor, I don't think this should block the 0.17.0 release.
        Hide
        StephanErb Stephan Erb added a comment -

        Unfortunately, I have removed the milestone from this one. We don't have a way to reproduce this yet so we cannot fix it in time for 0.17.0.

        Show
        StephanErb Stephan Erb added a comment - Unfortunately, I have removed the milestone from this one. We don't have a way to reproduce this yet so we cannot fix it in time for 0.17.0.
        Hide
        ianschenck Ian Schenck added a comment -

        I ran into this error (I believe) and it was related to the kernel (not sure you'd call it a bug). I'm betting it is resolved in kernels post 3.18. I found that /etc/group and /etc/passwd were both not able to be written to on kernels before 3.18. When I switched to a newer kernel, I was able to groupadd.

        Show
        ianschenck Ian Schenck added a comment - I ran into this error (I believe) and it was related to the kernel (not sure you'd call it a bug). I'm betting it is resolved in kernels post 3.18. I found that /etc/group and /etc/passwd were both not able to be written to on kernels before 3.18. When I switched to a newer kernel, I was able to groupadd.
        Hide
        StephanErb Stephan Erb added a comment -

        Now that you mention it, I realize that I have observed something similar as well. Using Debian 7 (with backport kernel 3.16) we where not able to launch containers featuring newer versions such as Debian 8. Only after the upgrade to Debian 8 (with backport kernel 4.x) the problem disappeared.

        Show
        StephanErb Stephan Erb added a comment - Now that you mention it, I realize that I have observed something similar as well. Using Debian 7 (with backport kernel 3.16) we where not able to launch containers featuring newer versions such as Debian 8. Only after the upgrade to Debian 8 (with backport kernel 4.x) the problem disappeared.

          People

          • Assignee:
            Unassigned
            Reporter:
            jvenus Justin Venus
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:

              Development