Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-4831

Master sometimes sends two inverse offers after the agent goes into maintenance.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 0.27.0
    • 0.28.0
    • None

    Description

      Showed up on ASF CI for MasterMaintenanceTest.PendingUnavailabilityTest

      https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull

      I0229 11:08:57.027559   668 hierarchical.cpp:1437] No resources available to allocate!
      I0229 11:08:57.027745   668 hierarchical.cpp:1150] Performed allocation for slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0 in 272747ns
      I0229 11:08:57.027757   675 master.cpp:5369] Sending 1 offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
      I0229 11:08:57.028586   675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
      I0229 11:08:57.029039   675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
      

      The ideal expected workflow for this test is something like:

      • The framework receives offers from master.
      • The framework updates its maintenance schedule.
      • The current offer is rescinded.
      • A new offer is received from the master with unavailability set.
      • After the agent goes for maintenance, an inverse offer is sent.

      For some reason, in the logs we see that the master is sending 2 inverse offers. The test seems to pass as we just check for the initial inverse offer being present. This can also be reproduced by a modified version of the original test.

      // Test ensures that an offer will have an `unavailability` set if the
      // slave is scheduled to go down for maintenance.
      TEST_F(MasterMaintenanceTest, PendingUnavailabilityTest)
      {
        Try<PID<Master>> master = StartMaster();
        ASSERT_SOME(master);
      
        MockExecutor exec(DEFAULT_EXECUTOR_ID);
      
        Try<PID<Slave>> slave = StartSlave(&exec);
        ASSERT_SOME(slave);
      
        auto scheduler = std::make_shared<MockV1HTTPScheduler>();
      
        EXPECT_CALL(*scheduler, heartbeat(_))
          .WillRepeatedly(Return()); // Ignore heartbeats.
      
        Future<Nothing> connected;
        EXPECT_CALL(*scheduler, connected(_))
          .WillOnce(FutureSatisfy(&connected))
          .WillRepeatedly(Return()); // Ignore future invocations.
      
        scheduler::TestV1Mesos mesos(master.get(), ContentType::PROTOBUF, scheduler);
      
        AWAIT_READY(connected);
      
        Future<Event::Subscribed> subscribed;
        EXPECT_CALL(*scheduler, subscribed(_, _))
          .WillOnce(FutureArg<1>(&subscribed));
      
        Future<Event::Offers> normalOffers;
        Future<Event::Offers> unavailabilityOffers;
        Future<Event::Offers> inverseOffers;
        EXPECT_CALL(*scheduler, offers(_, _))
          .WillOnce(FutureArg<1>(&normalOffers))
          .WillOnce(FutureArg<1>(&unavailabilityOffers))
          .WillOnce(FutureArg<1>(&inverseOffers));
      
        // The original offers should be rescinded when the unavailability is changed.
        Future<Nothing> offerRescinded;
        EXPECT_CALL(*scheduler, rescind(_, _))
          .WillOnce(FutureSatisfy(&offerRescinded));
      
        {
          Call call;
          call.set_type(Call::SUBSCRIBE);
      
          Call::Subscribe* subscribe = call.mutable_subscribe();
          subscribe->mutable_framework_info()->CopyFrom(DEFAULT_V1_FRAMEWORK_INFO);
      
          mesos.send(call);
        }
      
        AWAIT_READY(subscribed);
      
        v1::FrameworkID frameworkId(subscribed->framework_id());
      
        AWAIT_READY(normalOffers);
        EXPECT_NE(0, normalOffers->offers().size());
      
        // Regular offers shouldn't have unavailability.
        foreach (const v1::Offer& offer, normalOffers->offers()) {
          EXPECT_FALSE(offer.has_unavailability());
        }
      
        // Schedule this slave for maintenance.
        MachineID machine;
        machine.set_hostname(maintenanceHostname);
        machine.set_ip(stringify(slave.get().address.ip));
      
        const Time start = Clock::now() + Seconds(60);
        const Duration duration = Seconds(120);
        const Unavailability unavailability = createUnavailability(start, duration);
      
        // Post a valid schedule with one machine.
        maintenance::Schedule schedule = createSchedule(
            {createWindow({machine}, unavailability)});
      
        // We have a few seconds between the first set of offers and the
        // next allocation of offers. This should be enough time to perform
        // a maintenance schedule update. This update will also trigger the
        // rescinding of offers from the scheduled slave.
        Future<Response> response = process::http::post(
            master.get(),
            "maintenance/schedule",
            headers,
            stringify(JSON::protobuf(schedule)));
      
        AWAIT_EXPECT_RESPONSE_STATUS_EQ(OK().status, response);
      
        // The original offers should be rescinded when the unavailability
        // is changed.
        AWAIT_READY(offerRescinded);
      
        AWAIT_READY(unavailabilityOffers);
        EXPECT_NE(0, unavailabilityOffers->offers().size());
      
        // Make sure the new offers have the unavailability set.
        foreach (const v1::Offer& offer, unavailabilityOffers->offers()) {
          EXPECT_TRUE(offer.has_unavailability());
          EXPECT_EQ(
              unavailability.start().nanoseconds(),
              offer.unavailability().start().nanoseconds());
      
          EXPECT_EQ(
              unavailability.duration().nanoseconds(),
              offer.unavailability().duration().nanoseconds());
        }
      
        // We also expect an inverse offer for the slave to go under
        // maintenance.
        AWAIT_READY(inverseOffers);
        EXPECT_NE(0, inverseOffers->inverse_offers().size());
      
        EXPECT_CALL(exec, shutdown(_))
          .Times(AtMost(1));
      
        EXPECT_CALL(*scheduler, disconnected(_))
          .Times(AtMost(1));
      
        Shutdown(); // Must shutdown before 'containerizer' gets deallocated.
      }
      

      Also, unrelated, we need to clean up this test to not expect multiple offers i.e. remove numberOfOffers constant.

      Attachments

        Issue Links

          Activity

            People

              gyliu Guangya Liu
              anandmazumdar Anand Mazumdar
              Joris Van Remoortere Joris Van Remoortere
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: