Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
The following scenario was observed in a production environment.
1. The slave terminated due to a machine reboot. The last message from the slave prior to dying:
I0419 20:04:47.483470 38030 cgroups_isolation_module.cpp:774] Updated 'memory.soft_limit_in_bytes' to 134217728 for <executor> of framework <framework>
2. The master then deactivated the slave due to a health check failure:
W0419 20:06:15.952864 28740 master.cpp:1172] Removing slave 201303271650-1944527370-5050-24955-2642 at <slave> because it has been deactivated
I0419 20:06:15.956354 28740 master.cpp:1181] Master now considering a slave at <slave> as inactive
3. The slave came back up and registered 4 minutes later:
I0419 20:10:19.693114 8181 main.cpp:131] Starting Mesos slave
I0419 20:10:31.314707 8201 slave.cpp:398] Registered with master; given slave ID 201304011727-2230002186-5050-28738-377
4. The master added the slave correctly (the slave had some registration retries):
I0419 20:10:22.171288 28742 master.cpp:924] Attempting to register slave on <slave> at slave(1)@<ip>
I0419 20:10:22.171336 28742 master.cpp:1159] Master now considering a slave at <slave> as active
I0419 20:10:22.171361 28742 master.cpp:1735] Adding slave 201304011727-2230002186-5050-28738-377 at <slave> with cpus=22; mem=21962; ports=[31000-32000]; disk=400000
I0419 20:10:22.171661 28739 hierarchical_allocator_process.hpp:393] Added slave 201304011727-2230002186-5050-28738-377 (<slave>) with cpus=22; mem=21962; ports=[31000-32000]; disk=400000 (and cpus=22; mem=21962; ports=[31000-32000]; disk=400000 available)
...
I0419 20:10:23.171977 28754 master.cpp:913] Slave 201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending acknowledgement
...
I0419 20:10:24.172746 28752 master.cpp:913] Slave 201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending acknowledgement
...
I0419 20:10:25.173225 28747 master.cpp:913] Slave 201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending acknowledgement
...
I0419 20:10:26.174988 28740 master.cpp:913] Slave 201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending acknowledgement
...
I0419 20:10:27.176584 28739 master.cpp:913] Slave 201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending acknowledgement
...
I0419 20:10:28.178421 28754 master.cpp:913] Slave 201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending acknowledgement
...
I0419 20:10:29.179831 28740 master.cpp:913] Slave 201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending acknowledgement
...
I0419 20:10:31.312396 28750 master.cpp:913] Slave 201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending acknowledgement
5. Then, ~6 minutes after the original slave went down, the master received an exited event from the slave PID. It's highly likely this was a delayed signal for the old slave:
I0419 20:10:31.313608 28750 master.cpp:518] Slave 201304011727-2230002186-5050-28738-377(<slave>) disconnected
This stems from the fact that we do not know the SlaveID for the exited event, so we incorrectly remove the slave. One fix for this is to tell the slave to shut down when we remove it.
Without the proposed fix, the slave remained in a state of thinking it is registered, but the master disagrees.