Details
-
Bug
-
Status: Accepted
-
Critical
-
Resolution: Unresolved
-
1.9.0
-
None
Description
We have observed the following in our environment.
F1003 17:35:30.742681 58334 master.cpp:12528] Check failed: slave != nullptr f664c4a9-d1ca-4cd0-88e4-0a6acf20e629-S218 *** Check failure stack trace: *** @ 0x7fd36ca9cf4d google::LogMessage::Fail() @ 0x7fd36ca9f13d google::LogMessage::SendToLog() @ 0x7fd36ca9ca87 google::LogMessage::Flush() @ 0x7fd36ca9fbc9 google::LogMessageFatal::~LogMessageFatal() @ 0x7fd36b5ae3bc mesos::internal::master::Master::removeOperation() @ 0x7fd36b5b3446 mesos::internal::master::Master::updateOperationStatus()
This follows registration of an agent that has changed its agent ID due to losing its local state.
The check failure code is inĀ Master::removeOperation.
The masters would enter a crash loop unless the operation checkpoint state (i.e., resources_and_operations.state) on the offending agent is deleted.
Even thought we try to minimize the cases where an agent would lose its state, it can still happen when the latest symlink is removed either by an operator or automatically in certain cases.