Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-2668

Slave fails to recover when there are still processes left in its cgroup

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.1
    • Fix Version/s: 0.22.1, 0.23.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      0.22.1-rc5
      CoreOS

      Description

      mesos-slave process enters failed state with "A slave (or child process) is still running, please check the process(es) '{ ... }' listed in /sys/fs/cgroup/cpu,cpuacct/mesos/slave/cgroups.proc"

      Shortly after starting a test cluster we are seeing mesos-slaves enter a permanent failed state after a short period (~10 minutes or so). Here is the failure we are seeing with mesos-slave:

      Apr 27 23:29:33 ip-10-229-44-239.ec2.internal systemd[1]: mesos-slave.service: main process exited, code=exited, status=1/FAILURE
      Apr 27 23:29:33 ip-10-229-44-239.ec2.internal systemd[1]: Unit mesos-slave.service entered failed state.
      Apr 27 23:29:33 ip-10-229-44-239.ec2.internal systemd[1]: mesos-slave.service failed.
      Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: mesos-slave.service holdoff time over, scheduling restart.
      Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: Stopping Mesos Slave...
      Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: Starting Mesos Slave...
      Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: PING leader.mesos (10.155.13.144) 56(84) bytes of data.
      Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: 64 bytes from ip-10-155-13-144.ec2.internal (10.155.13.144): icmp_seq=1 ttl=49 time=0.277 ms
      Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: --- leader.mesos ping statistics ---
      Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
      Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: rtt min/avg/max/mdev = 0.277/0.277/0.277/0.000 ms
      Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: Started Mesos Slave.
      Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 23:29:38.908716  9274 logging.cpp:172] INFO level logging started!
      Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 23:29:38.909025  9274 main.cpp:156] Build: 2015-04-25 01:51:59 by
      Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 23:29:38.909044  9274 main.cpp:158] Version: 0.22.1
      Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 23:29:38.909054  9274 main.cpp:161] Git tag: 0.22.1-rc5
      Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 23:29:38.909065  9274 main.cpp:165] Git SHA: 13e0536a4522c5674abc920ee9b8597d83c5352a
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 23:29:39.012763  9274 containerizer.cpp:110] Using isolation: cgroups/cpu,cgroups/mem
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 23:29:39.027866  9274 linux_launcher.cpp:94] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 23:29:39.028177  9274 main.cpp:200] Starting Mesos slave
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@716: Client environment:host.name=ip-10-229-44-239.ec2.internal
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@724: Client environment:os.arch=3.19.0
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@725: Client environment:os.version=#2 SMP Fri Mar 6 00:23:51 UTC 2015
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@733: Client environment:user.name=(null)
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@741: Client environment:user.home=/root
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@753: Client environment:user.dir=/
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=leader.mesos:2181 sessionTimeout=10000 watcher=0x7f8393fa9ae0 sessionId=0 sessionPasswd=<null> context=0x7f8364000970 flags=0
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 23:29:39.028657  9274 slave.cpp:174] Slave started on 1)@10.229.44.239:5051
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 23:29:39.028688  9274 slave.cpp:194] Moving slave process into its own cgroup for subsystem: cpu
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 23:29:39,030:9274(0x7f83877c8700):ZOO_INFO@check_events@1703: initiated connection to server [10.155.13.144:2181]
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: A slave (or child process) is still running, please check the process(es) '{ 2242, 2243, 2244 }' listed in /sys/fs/cgroup/cpu,cpuacct/mesos/slave/cgroups.proc
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 23:29:39,032:9274(0x7f83877c8700):ZOO_INFO@check_events@1750: session establishment complete on server [10.155.13.144:2181], sessionId=0x274cfd1112750268, negotiated timeout=10000
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal systemd[1]: mesos-slave.service: main process exited, code=exited, status=1/FAILURE
      Apr 27 23:29:39 ip-10-229-44-239.ec2.internal systemd[1]: Unit mesos-slave.service entered failed state.
      

        Issue Links

          Activity

          Hide
          bmahler Benjamin Mahler added a comment -

          Linked in related tickets. cc Jie Yu / Ian Downes

          A slave (or child process) is still running, please check the process(es) '{ 2242, 2243, 2244 }' listed in /sys/fs/cgroup/cpu,cpuacct/mesos/slave/cgroups.proc
          

          What were the processes?

          Show
          bmahler Benjamin Mahler added a comment - Linked in related tickets. cc Jie Yu / Ian Downes A slave (or child process) is still running, please check the process(es) '{ 2242, 2243, 2244 }' listed in /sys/fs/cgroup/cpu,cpuacct/mesos/slave/cgroups.proc What were the processes?
          Hide
          jlingmann Jeremy Lingmann added a comment -

          Benjamin Mahler details about processes here

          ip-10-229-44-239 ~ # systemctl status mesos-slave
          ● mesos-slave.service - Mesos Slave
             Loaded: loaded (/opt/mesosphere/packages/mesos--153429420f929f668f5cc14a444492da96e510c7/dcos.target.wants_slave/mesos-slave.service)
             Active: activating (auto-restart) (Result: exit-code) since Tue 2015-04-28 01:54:22 UTC; 915ms ago
            Process: 13531 ExecStart=/opt/mesosphere/packages/mesos--153429420f929f668f5cc14a444492da96e510c7/sbin/mesos-slave (code=exited, status=1/FAILURE)
            Process: 13529 ExecStartPre=/bin/ping -c1 leader.mesos (code=exited, status=0/SUCCESS)
           Main PID: 13531 (code=exited, status=1/FAILURE)
             CGroup: /system.slice/mesos-slave.service
                     ├─2242 sh -c logs() {   docker logs --follow $1 &   pid=$!   docker wait $1 >/dev/null 2>&1   sleep 10   kill -TERM $pid >/dev/null 2>&1 & } log...
                     ├─2243 docker logs --follow mesos-2952957c-9fe6-4617-9f87-461f20eb2886
                     └─2244 docker wait mesos-2952957c-9fe6-4617-9f87-461f20eb2886
          
          Show
          jlingmann Jeremy Lingmann added a comment - Benjamin Mahler details about processes here ip-10-229-44-239 ~ # systemctl status mesos-slave ● mesos-slave.service - Mesos Slave Loaded: loaded (/opt/mesosphere/packages/mesos--153429420f929f668f5cc14a444492da96e510c7/dcos.target.wants_slave/mesos-slave.service) Active: activating (auto-restart) (Result: exit-code) since Tue 2015-04-28 01:54:22 UTC; 915ms ago Process : 13531 ExecStart=/opt/mesosphere/packages/mesos--153429420f929f668f5cc14a444492da96e510c7/sbin/mesos-slave (code=exited, status=1/FAILURE) Process : 13529 ExecStartPre=/bin/ping -c1 leader.mesos (code=exited, status=0/SUCCESS) Main PID: 13531 (code=exited, status=1/FAILURE) CGroup: /system.slice/mesos-slave.service ├─2242 sh -c logs() { docker logs --follow $1 & pid=$! docker wait $1 >/dev/ null 2>&1 sleep 10 kill -TERM $pid >/dev/ null 2>&1 & } log... ├─2243 docker logs --follow mesos-2952957c-9fe6-4617-9f87-461f20eb2886 └─2244 docker wait mesos-2952957c-9fe6-4617-9f87-461f20eb2886
          Hide
          jlingmann Jeremy Lingmann added a comment -

          mesos slave flags set through env variables:

          MESOS_MASTER=zk://leader.mesos:2181/mesos
          MESOS_CONTAINERIZERS=docker,mesos
          MESOS_LOG_DIR=/var/log/mesos
          MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins
          MESOS_ISOLATION=cgroups/cpu,cgroups/mem
          MESOS_WORK_DIR=/ephemeral/mesos-slave
          MESOS_RESOURCES=ports:[1025-2180,2182-3887,3889-5049,5052-8079,8082-65535]
          MESOS_SLAVE_SUBSYSTEMS=cpu,memory
          MESOS_HOSTNAME=ec2-54-224-67-160.compute-1.amazonaws.com
          
          Show
          jlingmann Jeremy Lingmann added a comment - mesos slave flags set through env variables: MESOS_MASTER=zk: //leader.mesos:2181/mesos MESOS_CONTAINERIZERS=docker,mesos MESOS_LOG_DIR=/ var /log/mesos MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins MESOS_ISOLATION=cgroups/cpu,cgroups/mem MESOS_WORK_DIR=/ephemeral/mesos-slave MESOS_RESOURCES=ports:[1025-2180,2182-3887,3889-5049,5052-8079,8082-65535] MESOS_SLAVE_SUBSYSTEMS=cpu,memory MESOS_HOSTNAME=ec2-54-224-67-160.compute-1.amazonaws.com
          Hide
          bmahler Benjamin Mahler added a comment -

          Timothy Chen is it expected for the slave to leak those child docker commands?

          Show
          bmahler Benjamin Mahler added a comment - Timothy Chen is it expected for the slave to leak those child docker commands?
          Hide
          tnachen Timothy Chen added a comment -

          We do expect docker logs to keep running when slave is restarted as it's redirecting existing docker container logs to sandbox.
          Docker wait is launched by an command executor and ideally should be able to keep running under that so we can reattach and recover.

          Show
          tnachen Timothy Chen added a comment - We do expect docker logs to keep running when slave is restarted as it's redirecting existing docker container logs to sandbox. Docker wait is launched by an command executor and ideally should be able to keep running under that so we can reattach and recover.
          Hide
          jieyu Jie Yu added a comment -

          Timothy Chen I don't see an easy fix for this unless we relax this CHECK during slave startup. Ian Downes, what's the main motivation of adding this CHECK. Is it to make sure the slave resources accounting (e.g, cpuacct, memory) is accurate because you don't want to count the resources used by the previous slave?

          Show
          jieyu Jie Yu added a comment - Timothy Chen I don't see an easy fix for this unless we relax this CHECK during slave startup. Ian Downes , what's the main motivation of adding this CHECK. Is it to make sure the slave resources accounting (e.g, cpuacct, memory) is accurate because you don't want to count the resources used by the previous slave?
          Hide
          jieyu Jie Yu added a comment -

          Also, are you guys planning to cut another rc for 0.22.1 because of this bug?

          Show
          jieyu Jie Yu added a comment - Also, are you guys planning to cut another rc for 0.22.1 because of this bug?
          Hide
          idownes Ian Downes added a comment -

          I'm completely okay with removing the CHECK, it's not necessary. I think my original motivation was simply because I didn't expect there to be any processes after the slave restarted, i.e., I neglected to think about things like perf and du; it should have been just a LOG(INFO) or LOG(WARNING) right from the start.

          Show
          idownes Ian Downes added a comment - I'm completely okay with removing the CHECK, it's not necessary. I think my original motivation was simply because I didn't expect there to be any processes after the slave restarted, i.e., I neglected to think about things like perf and du; it should have been just a LOG(INFO) or LOG(WARNING) right from the start.
          Hide
          idownes Ian Downes added a comment -
          Show
          idownes Ian Downes added a comment - Review: https://reviews.apache.org/r/33647
          Hide
          adam-mesos Adam B added a comment -

          I'll cut a new rc6 once this is committed. Seems like a blocker for 0.22.1.

          Show
          adam-mesos Adam B added a comment - I'll cut a new rc6 once this is committed. Seems like a blocker for 0.22.1.
          Hide
          idownes Ian Downes added a comment -

          Branch: refs/heads/master
          Commit: 56513a130d497f87caad31b38671defe7ee9e7a0
          Parents: 8b365a4
          Author: Ian Downes <idownes@twitter.com>
          Authored: Tue Apr 28 16:30:47 2015 -0700
          Committer: Ian Downes <idownes@twitter.com>
          Committed: Tue Apr 28 16:48:01 2015 -0700

          Show
          idownes Ian Downes added a comment - Branch: refs/heads/master Commit: 56513a130d497f87caad31b38671defe7ee9e7a0 Parents: 8b365a4 Author: Ian Downes <idownes@twitter.com> Authored: Tue Apr 28 16:30:47 2015 -0700 Committer: Ian Downes <idownes@twitter.com> Committed: Tue Apr 28 16:48:01 2015 -0700
          Hide
          adam-mesos Adam B added a comment -

          Renamed this issue to make it more concise. Prepended the original title to the Description.

          Show
          adam-mesos Adam B added a comment - Renamed this issue to make it more concise. Prepended the original title to the Description.

            People

            • Assignee:
              tnachen Timothy Chen
              Reporter:
              jlingmann Jeremy Lingmann
              Shepherd:
              Benjamin Hindman
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development