Details
Description
The following happened after a master failover.
During slave reregistration, new tasks were added and the new leading master notified all of its subscribers, and triggered the following check failure:
F0222 15:53:44.440387 2805 master.cpp:11190] Check failed: 'framework' Must be non NULL *** Check failure stack trace: *** @ 0x7f1357be521d google::LogMessage::Fail() @ 0x7f1357be704d google::LogMessage::SendToLog() @ 0x7f1357be4e0c google::LogMessage::Flush() @ 0x7f1357be7949 google::LogMessageFatal::~LogMessageFatal() @ 0x7f1356c80e2d google::CheckNotNull<>() @ 0x7f1356ce2666 mesos::internal::master::Master::Subscribers::send() @ 0x7f1356cece83 mesos::internal::master::Slave::addTask() @ 0x7f1356cf3206 mesos::internal::master::Slave::Slave() @ 0x7f1356cf5b90 mesos::internal::master::Master::__reregisterSlave() @ 0x7f1356d02cf8 mesos::internal::master::Master::_reregisterSlave() @ 0x7f1357b43761 process::ProcessBase::consume() @ 0x7f1357b5248c process::ProcessManager::resume() @ 0x7f1357b579f6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv @ 0x7f1354e6c230 (unknown) @ 0x7f135468ae25 start_thread @ 0x7f13543b834d __clone
This was because the master tried to get the framework info when sending the notification: https://github.com/apache/mesos/blob/1.5.x/src/master/master.cpp#L11190
But it added the framework after that:
https://github.com/apache/mesos/blob/1.5.x/src/master/master.cpp#L6963
Attachments
Issue Links
- is duplicated by
-
MESOS-8602 Subscribers::send incorrectly assumes frameworks are registered
- Resolved