Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-4795

mesos agent not recovering after ZK init failure

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.24.1
    • None
    • agent
    • None

    Description

      Here's the sequence of events that happened:

      -Agent running fine with 0.24.1
      -Transient ZK issues, slave flapping with zookeeper_init failure
      -ZK issue resolved
      -Most agents stop flapping and function correctly
      -Some agents continue flapping, but silent exit after printing the detector.cpp:481 log line.
      -The agents that continue to flap repaired with manual removal of contents in mesos-slave's working dir

      Here's the contents of the various log files on the agent:

      The .INFO logfile for one of the restarts before mesos-slave process exited with no other error messages:

      Log file created at: 2016/02/09 02:12:48
      Running on machine: titusagent-main-i-7697a9c5
      Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
      I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging started!
      I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 by builds
      I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
      I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation: posix/cpu,posix/mem,filesystem/posix
      I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
      I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@10.138.146.230:7101
      I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup: --appc_store_dir="/tmp/mesos/store/appc" --attributes="region:us-east-1;<snip>" --authenticatee="<snip>" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" <snip>"
      I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources: ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
      I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <snip>
      I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
      I0209 02:12:48.516139 97299 group.cpp:331] Group process (group(1)@10.138.146.230:7101) connected to ZooKeeper
      I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
      I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path '/titus/main/mesos' in ZooKeeper
      I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader: (id='209')
      I0209 02:12:48.520803 97284 group.cpp:674] Trying to get '/titus/main/mesos/json.info_0000000209' in ZooKeeper
      I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from '/mnt/data/mesos/meta'
      I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources file '/mnt/data/mesos/meta/resources/resources.info'
      I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master (UPID=master@10.230.95.110:7103) is detected
      

      The .FATAL log file when the original transient ZK error occurred:

      Log file created at: 2016/02/05 17:21:37
      Running on machine: titusagent-main-i-7697a9c5
      Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
      F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
      

      The .ERROR log file:

      Log file created at: 2016/02/05 17:21:37
      Running on machine: titusagent-main-i-7697a9c5
      Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
      F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
      

      The .WARNING file had the same content.

      Attachments

        Activity

          People

            Unassigned Unassigned
            spodila@netflix.com Sharma Podila
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: