Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-2797

mesos-slave dies when it hits open file descriptor limit

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.22.1
    • None
    • None

    Description

      I'm running mesos-slave under systemd as part of Mesosphere's DCOS. The slave process is repeatedly dying as it hits the system's open file descriptor limit of 1024. See the below "master-slave.log" file.

      I stop mesos-slave, remove the directory specified in the slave logs, and still get the same error. lsof shows that mesos-slave is opening several hundred pipes. See the below "lsof.log" file.

      ====mesos-slave.log====
      Jun 01 23:49:19 dcos-01 systemd[1]: mesos-slave.service holdoff time over, scheduling restart.
      Jun 01 23:49:19 dcos-01 systemd[1]: Stopping Mesos Slave...
      Jun 01 23:49:19 dcos-01 systemd[1]: Starting Mesos Slave...
      Jun 01 23:49:19 dcos-01 ping[14896]: PING leader.mesos (172.17.8.101) 56(84) bytes of data.
      Jun 01 23:49:19 dcos-01 ping[14896]: 64 bytes from dcos-01 (172.17.8.101): icmp_seq=1 ttl=64 time=0.023 ms
      Jun 01 23:49:19 dcos-01 ping[14896]: — leader.mesos ping statistics —
      Jun 01 23:49:19 dcos-01 ping[14896]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
      Jun 01 23:49:19 dcos-01 ping[14896]: rtt min/avg/max/mdev = 0.023/0.023/0.023/0.000 ms
      Jun 01 23:49:19 dcos-01 systemd[1]: Started Mesos Slave.
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.713110 14899 logging.cpp:172] INFO level logging started!
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.715564 14899 main.cpp:156] Build: 2015-05-19 18:43:41 by
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.715600 14899 main.cpp:158] Version: 0.22.1
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.715618 14899 main.cpp:165] Git SHA: dd082c8656eb6e93e091a12fc5cfee3700a61bb1
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.830142 14899 containerizer.cpp:110] Using isolation: cgroups/cpu,cgroups/mem
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.845340 14899 linux_launcher.cpp:94] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.845696 14899 main.cpp:200] Starting Mesos slave
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,845:14899(0x7f111ff43700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@716: Client environment:host.name=dcos-01
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@724: Client environment:os.arch=3.19.0
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@725: Client environment:os.version=#2 SMP Thu Mar 26 10:44:46 UTC 2015
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@733: Client environment:user.name=(null)
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@741: Client environment:user.home=/root
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@753: Client environment:user.dir=/
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=leader.mesos:2181 sessionTimeout=10000 watcher=0x7f11246c0140 sessionId=0 sessionPasswd=<null> context=0x7f1114000b40 flags=0
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.846161 14899 slave.cpp:174] Slave started on 1)@172.17.8.101:5051
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.846206 14899 slave.cpp:194] Moving slave process into its own cgroup for subsystem: cpu
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,855:14899(0x7f110bde7700):ZOO_INFO@check_events@1703: initiated connection to server [172.17.8.101:2181]
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,855:14899(0x7f110bde7700):ZOO_INFO@check_events@1750: session establishment complete on server [172.17.8.101:2181], sessionId=0x14d77b31175030e, negotiated timeout=10000
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.856979 14900 group.cpp:313] Group process (group(1)@172.17.8.101:5051) connected to ZooKeeper
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.857028 14900 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.857049 14900 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.869518 14900 detector.cpp:138] Detected a new leader: (id='16')
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.869675 14900 group.cpp:659] Trying to get '/mesos/info_0000000016' in ZooKeeper
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.870889 14900 detector.cpp:452] A new leading master (UPID=master@172.17.8.101:5050) is detected
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.875787 14899 slave.cpp:194] Moving slave process into its own cgroup for subsystem: memory
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.880331 14899 slave.cpp:322] Slave resources: ports:[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 8182-65535]; cpus:4; mem:2933; disk:10823
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.880523 14899 slave.cpp:351] Slave hostname: dcos-01
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.880553 14899 slave.cpp:352] Slave checkpoint: true
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.883630 14903 state.cpp:35] Recovering state from '/var/lib/mesos/slave/meta'
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.883815 14900 status_update_manager.cpp:197] Recovering status update manager
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.883940 14904 containerizer.cpp:307] Recovering containerizer
      Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.883949 14907 docker.cpp:423] Recovering Docker containers
      Jun 01 23:49:24 dcos-01 mesos-slave[14899]: Failed to perform recovery: Collect failed: Collect failed: Failed to create pipe: Too many open files
      Jun 01 23:49:24 dcos-01 mesos-slave[14899]: To remedy this do as follows:
      Jun 01 23:49:24 dcos-01 mesos-slave[14899]: Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
      Jun 01 23:49:24 dcos-01 mesos-slave[14899]: This ensures slave doesn't recover old live executors.
      Jun 01 23:49:24 dcos-01 mesos-slave[14899]: Step 2: Restart the slave.
      Jun 01 23:49:24 dcos-01 systemd[1]: mesos-slave.service: main process exited, code=exited, status=1/FAILURE
      Jun 01 23:49:24 dcos-01 systemd[1]: Unit mesos-slave.service entered failed state.
      Jun 01 23:49:24 dcos-01 systemd[1]: mesos-slave.service failed.

      =====lsof.log====
      mesos-sla 30306 root 563r FIFO 0,9 0t0 10642859 pipe
      mesos-sla 30306 root 564r FIFO 0,9 0t0 10642862 pipe
      mesos-sla 30306 root 565r FIFO 0,9 0t0 10642861 pipe
      mesos-sla 30306 root 566r FIFO 0,9 0t0 10642864 pipe
      mesos-sla 30306 root 567r FIFO 0,9 0t0 10642863 pipe
      mesos-sla 30306 root 568r FIFO 0,9 0t0 10642866 pipe
      mesos-sla 30306 root 569r FIFO 0,9 0t0 10642865 pipe
      mesos-sla 30306 root 570r FIFO 0,9 0t0 10642868 pipe
      mesos-sla 30306 root 571r FIFO 0,9 0t0 10642867 pipe
      mesos-sla 30306 root 572r FIFO 0,9 0t0 10642879 pipe
      mesos-sla 30306 root 573r FIFO 0,9 0t0 10642869 pipe
      mesos-sla 30306 root 574r FIFO 0,9 0t0 10642881 pipe
      mesos-sla 30306 root 575r FIFO 0,9 0t0 10642880 pipe
      mesos-sla 30306 root 576r FIFO 0,9 0t0 10642883 pipe
      mesos-sla 30306 root 577r FIFO 0,9 0t0 10642882 pipe
      mesos-sla 30306 root 578r FIFO 0,9 0t0 10642891 pipe
      mesos-sla 30306 root 579r FIFO 0,9 0t0 10642884 pipe
      mesos-sla 30306 root 580r FIFO 0,9 0t0 10642893 pipe
      mesos-sla 30306 root 581r FIFO 0,9 0t0 10642892 pipe
      mesos-sla 30306 root 582r FIFO 0,9 0t0 10642895 pipe
      mesos-sla 30306 root 583r FIFO 0,9 0t0 10642894 pipe
      mesos-sla 30306 root 584r FIFO 0,9 0t0 10642899 pipe
      mesos-sla 30306 root 585r FIFO 0,9 0t0 10642896 pipe
      mesos-sla 30306 root 586r FIFO 0,9 0t0 10642901 pipe
      mesos-sla 30306 root 587r FIFO 0,9 0t0 10642900 pipe
      mesos-sla 30306 root 588r FIFO 0,9 0t0 10642904 pipe
      mesos-sla 30306 root 589r FIFO 0,9 0t0 10642902 pipe
      mesos-sla 30306 root 590r FIFO 0,9 0t0 10642906 pipe
      mesos-sla 30306 root 591r FIFO 0,9 0t0 10642905 pipe
      mesos-sla 30306 root 592r FIFO 0,9 0t0 10642908 pipe
      mesos-sla 30306 root 593r FIFO 0,9 0t0 10642907 pipe
      mesos-sla 30306 root 594r FIFO 0,9 0t0 10642910 pipe
      mesos-sla 30306 root 595r FIFO 0,9 0t0 10642909 pipe
      mesos-sla 30306 root 596r FIFO 0,9 0t0 10642918 pipe
      mesos-sla 30306 root 597r FIFO 0,9 0t0 10642911 pipe
      mesos-sla 30306 root 598r FIFO 0,9 0t0 10642920 pipe
      mesos-sla 30306 root 599r FIFO 0,9 0t0 10642919 pipe
      mesos-sla 30306 root 600r FIFO 0,9 0t0 10642922 pipe
      mesos-sla 30306 root 601r FIFO 0,9 0t0 10642921 pipe
      mesos-sla 30306 root 602r FIFO 0,9 0t0 10642924 pipe
      mesos-sla 30306 root 603r FIFO 0,9 0t0 10642923 pipe
      mesos-sla 30306 root 604r FIFO 0,9 0t0 10642926 pipe
      mesos-sla 30306 root 605r FIFO 0,9 0t0 10642925 pipe
      mesos-sla 30306 root 606r FIFO 0,9 0t0 10642928 pipe
      mesos-sla 30306 root 607r FIFO 0,9 0t0 10642927 pipe
      mesos-sla 30306 root 608r FIFO 0,9 0t0 10642933 pipe
      mesos-sla 30306 root 609r FIFO 0,9 0t0 10642929 pipe
      mesos-sla 30306 root 610r FIFO 0,9 0t0 10642935 pipe
      mesos-sla 30306 root 611r FIFO 0,9 0t0 10642934 pipe
      mesos-sla 30306 root 612r FIFO 0,9 0t0 10642937 pipe
      mesos-sla 30306 root 613r FIFO 0,9 0t0 10642936 pipe
      mesos-sla 30306 root 614r FIFO 0,9 0t0 10642939 pipe
      mesos-sla 30306 root 615r FIFO 0,9 0t0 10642938 pipe
      mesos-sla 30306 root 616r FIFO 0,9 0t0 10642949 pipe
      mesos-sla 30306 root 617r FIFO 0,9 0t0 10642940 pipe

      Attachments

        Activity

          People

            chenlily Lily Chen
            mgummelt Michael Gummelt
            Timothy Chen Timothy Chen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: