Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
0.27.1
-
None
-
None
Description
I expected slaves to have to be gone the re-registration timeout before they'd be lost to the cluster, not fail 5 healtchecks (Failing the healthchecks indicates there is a network partition, not that the agent is gone for good and will never come back).
Is there some flag I'm missing here which I should be setting?
From my perspective I expect frameworks to not get offers for resources on agents which haven't been contacted recently (The framework wouldn't be able to launch anything on the agent). Once the re-registration period times out the slave would be assumed completely lost and the tasks assumed terminated / able to be re-launched if desired. If an agent recovers between the healthcheck timeout and re-registration timeout, it should be able to re-join the cluster with its running tasks kept running.
Note: Some log lines have their start or tail truncated. Critical stuff should all be there
Master flags
Feb 11 00:22:19 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:22:19.690507 1362 master.cpp:369] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --cluster="cody-cm52sd-2" --framework_sorter="drf" --help="false" --hostname_lookup="false" --initialize_driver_logging="true" --ip_discovery_command="/opt/mesosphere/bin/detect_ip" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="1" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --roles="slave_public" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/opt/mesosphere/packages/mesos--4dd59ec6bde2052f6f2a0a0da415b6c92c3c418a/share/mesos/webui" --weights="slave_public=1" --work_dir="/var/lib/mesos/master" --zk="zk://127.0.0.1:2181/mesos" --zk_session_timeout="10secs"
Slave flags
Feb 11 00:34:13 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3914]: I0211 00:34:13.334395 3914 slave.cpp:192] Flags at startup: --appc_store_dir="/tmp/mesos/store/appc" --authenticatee="crammd5" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_auth_server="auth.docker.io" --docker_auth_server_port="443" --docker_kill_orphans="true" --docker_local_archives_dir="/tmp/mesos/images/docker" --docker_puller="local" --docker_puller_timeout="60" --docker_registry="registry-1.docker.io" --docker_registry_port="443" --docker_remove_delay="1hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --enforce_container_disk_quota="false" --executor_environment_variables="{"LD_LIBRARY_PATH":"\/opt\/mesosphere\/lib","PATH":"\/usr\/bin:\/bin","SASL_PATH":"\/opt\/mesosphere\/lib\/sasl2","SHELL":"\/usr\/bin\/bash"}" --executor_registration_timeout="5mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="2days" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname_lookup="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --ip_discovery_command="/opt/mesosphere/bin/detect_ip" --isolation="cgroups/cpu,cgroups/mem" --launcher_dir="/opt/mesosphere/packages/mesos--4dd59ec6bde2052f6f2a0a0da415b6c92c3c418a/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://leader.mesos:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --resources="ports:[1025-2180,2182-3887,3889-5049,5052-8079,8082-8180,8182-32000]" --re Feb 11 00:34:13 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3914]: vocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --slave_subsystems="cpu,memory" --strict="true" --switch_user="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos/slave"
Restarting the slave0
Feb 11 00:32:44 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3261]: W0211 00:32:40.981289 3261 logging.cpp:81] RAW: Received signal SIGTERM from process 1 of user 0; exiting Feb 11 00:32:44 ip-10-0-0-52.us-west-2.compute.internal systemd[1]: Stopping Mesos Slave... Feb 11 00:32:44 ip-10-0-0-52.us-west-2.compute.internal systemd[1]: Stopped Mesos Slave. Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal systemd[1]: Starting Mesos Slave... Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal ping[3534]: PING leader.mesos (10.0.4.187) 56(84) bytes of data. Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal ping[3534]: 64 bytes from ip-10-0-4-187.us-west-2.compute.internal (10.0.4.187): icmp_seq=1 ttl=64 time=0.314 ms Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal ping[3534]: --- leader.mesos ping statistics --- Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal ping[3534]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal ping[3534]: rtt min/avg/max/mdev = 0.314/0.314/0.314/0.000 ms Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal systemd[1]: Started Mesos Slave. Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: I0211 00:34:02.256242 3536 logging.cpp:172] INFO level logging started!
The slave detects the new master, gets shutdown for re-registering after removal
Feb 11 00:34:04 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: I0211 00:34:04.705356 3546 slave.cpp:729] New master detected at master@10.0.4.187:5050 Feb 11 00:34:04 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: I0211 00:34:04.705366 3539 status_update_manager.cpp:176] Pausing sending status updates Feb 11 00:34:04 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: I0211 00:34:04.705550 3546 slave.cpp:754] No credentials provided. Attempting to register without authentication Feb 11 00:34:04 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: I0211 00:34:04.705597 3546 slave.cpp:765] Detecting new master Feb 11 00:34:05 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: I0211 00:34:05.624832 3544 slave.cpp:643] Slave asked to shut down by master@10.0.4.187:5050 because 'Slave attempted to re-register after removal' Feb 11 00:34:05 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: I0211 00:34:05.624908 3544 slave.cpp:2009] Asked to shut down framework 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-0000 by master@10.0.4.187:5050 Feb 11 00:34:05 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: I0211 00:34:05.624939 3544 slave.cpp:2034] Shutting down framework 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-0000
Snippet of master flags
Feb 11 00:22:19 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:22:19.690507 1362 master.cpp:369] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --cluster="cody-cm52sd-2" --framework_sorter="drf" --help="false" --hostname_lookup="false" --initialize_driver_logging="true" --ip_discovery_command="/opt/mesosphere/bin/detect_ip" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="1" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --roles="slave_public" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/opt/mesosphere/packages/mesos--4dd59ec6bde2052f6f2a0a0da415b6c92c3c418a/share/mesos/webui" --weights="slave_public=1" --work_dir="/var/lib/mesos/master" --zk="zk://127.0.0.1:2181/mesos" --zk_session_timeout="10secs"
Master initially registering the slave
Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:23:01.968310 1373 master.cpp:3859] Registering slave at slave(1)@10.0.0.52:5051 (10.0.0.52) with id 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0
Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:23:01.976769 1374 log.cpp:704] Attempting to truncate the log to 3 Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:23:01.976820 1370 coordinator.cpp:350] Coordinator attempting to write TRUNCATE action at position 4 Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:23:01.977002 1369 replica.cpp:540] Replica received write request for position 4 from (13)@10.0.4.187:5050 Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:23:01.977157 1374 master.cpp:3927] Registered slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52) with ports(*):[1025- Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:23:01.977207 1368 hierarchical.cpp:344] Added slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 (10.0.0.52) with ports(*):[1025-2180, 2182-3887, 3889-5049, Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:23:01.977552 1368 master.cpp:4979] Sending 1 offers to framework 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-0000 (marathon) at scheduler-8174298d-3ef3-4683-9 Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:23:01.978520 1369 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 1.485099ms Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:23:01.978559 1369 replica.cpp:715] Persisted action at 4 Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:23:01.978710 1369 replica.cpp:694] Replica received learned notice for position 4 from @0.0.0.0:0 Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:23:01.979212 1372 master.cpp:4269] Received update of slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52) with total o Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:23:01.979322 1372 hierarchical.cpp:400] Slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 (10.0.0.52) updated with oversubscribed resources (total: ports( Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:23:01.980257 1369 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 1.514614ms
Lose the slave
Feb 11 00:32:12 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:32:12.578547 1368 master.cpp:1083] Slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52) disconnected Feb 11 00:32:12 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:32:12.578627 1368 master.cpp:2531] Disconnecting slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52) Feb 11 00:32:12 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:32:12.578673 1368 master.cpp:2550] Deactivating slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52) Feb 11 00:32:12 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:32:12.578764 1374 hierarchical.cpp:429] Slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 deactivated
Slave came back (earlier restart, only gone for seconds)
Feb 11 00:32:15 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:32:15.965806 1370 master.cpp:4019] Re-registering slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52) Feb 11 00:32:15 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:32:15.966354 1373 hierarchical.cpp:417] Slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 reactivated Feb 11 00:32:15 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:32:15.966419 1370 master.cpp:4207] Sending updated checkpointed resources to slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 Feb 11 00:32:15 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:32:15.967167 1371 master.cpp:4269] Received update of slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52) with total o Feb 11 00:32:15 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:32:15.967296 1371 hierarchical.cpp:400] Slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 (10.0.0.52) updated with oversubscribed resources (total: ports(
This shutdown of the slave
Feb 11 00:32:44 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:32:44.142541 1371 http.cpp:334] HTTP GET for /master/state-summary from 10.0.4.187:44274 with User-Agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5 Feb 11 00:32:44 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:32:44.150949 1368 master.cpp:1083] Slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52) disconnected Feb 11 00:32:44 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:32:44.151002 1368 master.cpp:2531] Disconnecting slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52) Feb 11 00:32:44 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:32:44.151048 1368 master.cpp:2550] Deactivating slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52) Feb 11 00:32:44 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:32:44.151113 1368 hierarchical.cpp:429] Slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 deactivated
Slave lost (The critical part). Slave should be lost at healthcheck timeout, not shut down.
Feb 11 00:33:47 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:33:47.009037 1372 master.cpp:236] Shutting down slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 due to health check timeout Feb 11 00:33:47 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: W0211 00:33:47.009124 1372 master.cpp:4581] Shutting down slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52) with message 'hea Feb 11 00:33:47 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:33:47.009181 1372 master.cpp:5846] Removing slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52): health check timed ou Feb 11 00:33:47 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:33:47.009297 1372 master.cpp:6066] Updating the state of task test-app-2.4057f89f-d056-11e5-8aeb-0242d6f35f4b of framework 0c9ebb3f-23f8-4fce-b276-9ebc Feb 11 00:33:47 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:33:47.009353 1369 hierarchical.cpp:373] Removed slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0
Tasks marked as slave-lost
2] Removing task test-app.4076cb59-d056-11e5-8aeb-0242d6f35f4b with resources cpus(*):0.1; mem(*):16; ports(*):[2791-2791] of framework 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-0000 on slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1) 6] Updating the state of task test-app.40756bc5-d056-11e5-8aeb-0242d6f35f4b of framework 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-0000 (latest state: TASK_LOST, status update state: TASK_LOST) 2] Removing task test-app.40756bc5-d056-11e5-8aeb-0242d6f35f4b with resources cpus(*):0.1; mem(*):16; ports(*):[6724-6724] of framework 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-0000 on slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1) 6] Updating the state of task test-app-2.40765628-d056-11e5-8aeb-0242d6f35f4b of framework 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-0000 (latest state: TASK_LOST, status update state: TASK_LOST)
Slave gone gone
Feb 11 00:33:47 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:33:47.021023 1374 master.cpp:5965] Removed slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 (10.0.0.52): health check timed out
Master refuses to accept slave
Feb 11 00:34:05 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: W0211 00:34:05.614985 1368 master.cpp:3997] Slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52) attempted to re-register after
Slave comes up with new id, registers properly
Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:34:13.757870 1368 master.cpp:3859] Registering slave at slave(1)@10.0.0.52:5051 (10.0.0.52) with id 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S1 Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:34:13.758057 1372 registrar.cpp:441] Applied 1 operations in 23020ns; attempting to update the 'registry' Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:34:13.758257 1368 log.cpp:685] Attempting to append 367 bytes to the log Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:34:13.758316 1368 coordinator.cpp:350] Coordinator attempting to write APPEND action at position 7 Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:34:13.758450 1368 replica.cpp:540] Replica received write request for position 7 from (75)@10.0.4.187:5050 Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:34:13.759891 1368 leveldb.cpp:343] Persisting action (386 bytes) to leveldb took 1.411937ms Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:34:13.759927 1368 replica.cpp:715] Persisted action at 7 Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:34:13.760097 1368 replica.cpp:694] Replica received learned notice for position 7 from @0.0.0.0:0 Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:34:13.763203 1368 leveldb.cpp:343] Persisting action (388 bytes) to leveldb took 3.072892ms Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:34:13.763236 1368 replica.cpp:715] Persisted action at 7 Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:34:13.763250 1368 replica.cpp:700] Replica learned APPEND action at position 7
Attachments
Issue Links
- duplicates
-
MESOS-4049 Allow user to control behavior of partitioned agents/tasks
- Resolved