Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
1.0.4, 1.1.1
-
None
-
Mesosphere Sprint 59
Description
We just saw the following in a test cluster:
W0707 06:30:10.172188 9413 master.cpp:3939] Ignoring accept of inverse offer abd00119-7353-4990-9cc5-0d6bd69a91e7-O737973 since it is no longer valid F0707 06:30:10.172236 9413 master.cpp:3943] CHECK_SOME(slaveId): is NONE *** Check failure stack trace: *** @ 0x7f425b1521ed google::LogMessage::Fail() @ 0x7f425b15401d google::LogMessage::SendToLog() @ 0x7f425b151ddc google::LogMessage::Flush() @ 0x7f425b154919 google::LogMessageFatal::~LogMessageFatal() @ 0x7f425a564ce9 _CheckFatal::~_CheckFatal() @ 0x7f425a76a69d mesos::internal::master::Master::acceptInverseOffers() @ 0x7f425a6e360e mesos::internal::master::Master::Http::scheduler() @ 0x7f425a737347 _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionISsEEZN5mesos8internal6master6Master10initializeEvEUlS7_SB_E1_E9_M_invokeERKSt9_Any_dataS7_SB_ @ 0x7f425b0d7413 _ZZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultEEEEE0_clESC_ENKUlRKNS4_IbEEE1_clESG_ @ 0x7f425b0e1091 process::ProcessManager::resume() @ 0x7f425b0e1397 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv @ 0x7f4259770d73 (unknown) @ 0x7f4258f6d52c (unknown) @ 0x7f4258cab1dd (unknown)
This seems to happen for cases where we try to accept an invalid inverse offer and incorrectly assume that we can always extract an agent id,
Option<SlaveID> slaveId; // Update each inverse offer in the allocator with the accept and // filter. foreach (const OfferID& offerId, accept.inverse_offer_ids()) { InverseOffer* inverseOffer = getInverseOffer(offerId); if (inverseOffer != nullptr) { CHECK(inverseOffer->has_slave_id()); slaveId = inverseOffer->slave_id(); mesos::allocator::InverseOfferStatus status; status.set_status(mesos::allocator::InverseOfferStatus::ACCEPT); status.mutable_framework_id()->CopyFrom(inverseOffer->framework_id()); status.mutable_timestamp()->CopyFrom(protobuf::getCurrentTime()); allocator->updateInverseOffer( inverseOffer->slave_id(), inverseOffer->framework_id(), UnavailableResources{ inverseOffer->resources(), inverseOffer->unavailability()}, status, accept.filters()); removeInverseOffer(inverseOffer); continue; } // If the offer was not in our inverse offer set, then this // offer is no longer valid. LOG(WARNING) << "Ignoring accept of inverse offer " << offerId << " since it is no longer valid"; } CHECK_SOME(slaveId);
If offerId is invalid, slaveId will never be set to a value, causing the CHECK_SOME to fail.
I see this issue in 1.0.4 and 1.1.1; the problematic code seems to be gone in 1.1.2.
Attachments
Issue Links
- duplicates
-
MESOS-7119 Mesos master crash while accepting inverse offer.
- Resolved