Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7766

Segfault when trying to accept inverse offer with unknown offerId

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.0.4, 1.1.1
    • None
    • master
    • Mesosphere Sprint 59

    Description

      We just saw the following in a test cluster:

      W0707 06:30:10.172188  9413 master.cpp:3939] Ignoring accept of inverse offer abd00119-7353-4990-9cc5-0d6bd69a91e7-O737973 since it is no longer valid
      F0707 06:30:10.172236  9413 master.cpp:3943] CHECK_SOME(slaveId): is NONE
      *** Check failure stack trace: ***
          @     0x7f425b1521ed  google::LogMessage::Fail()
          @     0x7f425b15401d  google::LogMessage::SendToLog()
          @     0x7f425b151ddc  google::LogMessage::Flush()
          @     0x7f425b154919  google::LogMessageFatal::~LogMessageFatal()
          @     0x7f425a564ce9  _CheckFatal::~_CheckFatal()
          @     0x7f425a76a69d  mesos::internal::master::Master::acceptInverseOffers()
          @     0x7f425a6e360e  mesos::internal::master::Master::Http::scheduler()
          @     0x7f425a737347  _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionISsEEZN5mesos8internal6master6Master10initializeEvEUlS7_SB_E1_E9_M_invokeERKSt9_Any_dataS7_SB_
          @     0x7f425b0d7413  _ZZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultEEEEE0_clESC_ENKUlRKNS4_IbEEE1_clESG_
          @     0x7f425b0e1091  process::ProcessManager::resume()
          @     0x7f425b0e1397  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
          @     0x7f4259770d73  (unknown)
          @     0x7f4258f6d52c  (unknown)
          @     0x7f4258cab1dd  (unknown)
      

      This seems to happen for cases where we try to accept an invalid inverse offer and incorrectly assume that we can always extract an agent id,

      Option<SlaveID> slaveId;
      
      // Update each inverse offer in the allocator with the accept and
      // filter.
      foreach (const OfferID& offerId, accept.inverse_offer_ids()) {
        InverseOffer* inverseOffer = getInverseOffer(offerId);
        if (inverseOffer != nullptr) {
          CHECK(inverseOffer->has_slave_id());
          slaveId = inverseOffer->slave_id();
      
          mesos::allocator::InverseOfferStatus status;
          status.set_status(mesos::allocator::InverseOfferStatus::ACCEPT);
          status.mutable_framework_id()->CopyFrom(inverseOffer->framework_id());
          status.mutable_timestamp()->CopyFrom(protobuf::getCurrentTime());
      
          allocator->updateInverseOffer(
              inverseOffer->slave_id(),
              inverseOffer->framework_id(),
              UnavailableResources{
                  inverseOffer->resources(),
                  inverseOffer->unavailability()},
              status,
              accept.filters());
      
          removeInverseOffer(inverseOffer);
          continue;
        }
      
        // If the offer was not in our inverse offer set, then this
        // offer is no longer valid.
        LOG(WARNING) << "Ignoring accept of inverse offer " << offerId
                     << " since it is no longer valid";
      }
      
      CHECK_SOME(slaveId);
      

      If offerId is invalid, slaveId will never be set to a value, causing the CHECK_SOME to fail.

      I see this issue in 1.0.4 and 1.1.1; the problematic code seems to be gone in 1.1.2.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            alexr Alex R
            bbannier Benjamin Bannier
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Agile

                Completed Sprint:
                Mesosphere Sprint 59 ended 21/Jul/17
                View on Board

                Slack

                  Issue deployment