[MESOS-4831] Master sometimes sends two inverse offers after the agent goes into maintenance. - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.27.0
Fix Version/s: 0.28.0
Component/s: None
Labels:
- maintenance
- mesosphere

Epic Link:
Maintenance
Sprint:
Mesosphere Sprint 30

Description

Showed up on ASF CI for MasterMaintenanceTest.PendingUnavailabilityTest

https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull

I0229 11:08:57.027559   668 hierarchical.cpp:1437] No resources available to allocate!
I0229 11:08:57.027745   668 hierarchical.cpp:1150] Performed allocation for slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0 in 272747ns
I0229 11:08:57.027757   675 master.cpp:5369] Sending 1 offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
I0229 11:08:57.028586   675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
I0229 11:08:57.029039   675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)

The ideal expected workflow for this test is something like:

The framework receives offers from master.
The framework updates its maintenance schedule.
The current offer is rescinded.
A new offer is received from the master with unavailability set.
After the agent goes for maintenance, an inverse offer is sent.

For some reason, in the logs we see that the master is sending 2 inverse offers. The test seems to pass as we just check for the initial inverse offer being present. This can also be reproduced by a modified version of the original test.

// Test ensures that an offer will have an `unavailability` set if the
// slave is scheduled to go down for maintenance.
TEST_F(MasterMaintenanceTest, PendingUnavailabilityTest)
{
  Try<PID<Master>> master = StartMaster();
  ASSERT_SOME(master);

  MockExecutor exec(DEFAULT_EXECUTOR_ID);

  Try<PID<Slave>> slave = StartSlave(&exec);
  ASSERT_SOME(slave);

  auto scheduler = std::make_shared<MockV1HTTPScheduler>();

  EXPECT_CALL(*scheduler, heartbeat(_))
    .WillRepeatedly(Return()); // Ignore heartbeats.

  Future<Nothing> connected;
  EXPECT_CALL(*scheduler, connected(_))
    .WillOnce(FutureSatisfy(&connected))
    .WillRepeatedly(Return()); // Ignore future invocations.

  scheduler::TestV1Mesos mesos(master.get(), ContentType::PROTOBUF, scheduler);

  AWAIT_READY(connected);

  Future<Event::Subscribed> subscribed;
  EXPECT_CALL(*scheduler, subscribed(_, _))
    .WillOnce(FutureArg<1>(&subscribed));

  Future<Event::Offers> normalOffers;
  Future<Event::Offers> unavailabilityOffers;
  Future<Event::Offers> inverseOffers;
  EXPECT_CALL(*scheduler, offers(_, _))
    .WillOnce(FutureArg<1>(&normalOffers))
    .WillOnce(FutureArg<1>(&unavailabilityOffers))
    .WillOnce(FutureArg<1>(&inverseOffers));

  // The original offers should be rescinded when the unavailability is changed.
  Future<Nothing> offerRescinded;
  EXPECT_CALL(*scheduler, rescind(_, _))
    .WillOnce(FutureSatisfy(&offerRescinded));

  {
    Call call;
    call.set_type(Call::SUBSCRIBE);

    Call::Subscribe* subscribe = call.mutable_subscribe();
    subscribe->mutable_framework_info()->CopyFrom(DEFAULT_V1_FRAMEWORK_INFO);

    mesos.send(call);
  }

  AWAIT_READY(subscribed);

  v1::FrameworkID frameworkId(subscribed->framework_id());

  AWAIT_READY(normalOffers);
  EXPECT_NE(0, normalOffers->offers().size());

  // Regular offers shouldn't have unavailability.
  foreach (const v1::Offer& offer, normalOffers->offers()) {
    EXPECT_FALSE(offer.has_unavailability());
  }

  // Schedule this slave for maintenance.
  MachineID machine;
  machine.set_hostname(maintenanceHostname);
  machine.set_ip(stringify(slave.get().address.ip));

  const Time start = Clock::now() + Seconds(60);
  const Duration duration = Seconds(120);
  const Unavailability unavailability = createUnavailability(start, duration);

  // Post a valid schedule with one machine.
  maintenance::Schedule schedule = createSchedule(
      {createWindow({machine}, unavailability)});

  // We have a few seconds between the first set of offers and the
  // next allocation of offers. This should be enough time to perform
  // a maintenance schedule update. This update will also trigger the
  // rescinding of offers from the scheduled slave.
  Future<Response> response = process::http::post(
      master.get(),
      "maintenance/schedule",
      headers,
      stringify(JSON::protobuf(schedule)));

  AWAIT_EXPECT_RESPONSE_STATUS_EQ(OK().status, response);

  // The original offers should be rescinded when the unavailability
  // is changed.
  AWAIT_READY(offerRescinded);

  AWAIT_READY(unavailabilityOffers);
  EXPECT_NE(0, unavailabilityOffers->offers().size());

  // Make sure the new offers have the unavailability set.
  foreach (const v1::Offer& offer, unavailabilityOffers->offers()) {
    EXPECT_TRUE(offer.has_unavailability());
    EXPECT_EQ(
        unavailability.start().nanoseconds(),
        offer.unavailability().start().nanoseconds());

    EXPECT_EQ(
        unavailability.duration().nanoseconds(),
        offer.unavailability().duration().nanoseconds());
  }

  // We also expect an inverse offer for the slave to go under
  // maintenance.
  AWAIT_READY(inverseOffers);
  EXPECT_NE(0, inverseOffers->inverse_offers().size());

  EXPECT_CALL(exec, shutdown(_))
    .Times(AtMost(1));

  EXPECT_CALL(*scheduler, disconnected(_))
    .Times(AtMost(1));

  Shutdown(); // Must shutdown before 'containerizer' gets deallocated.
}

Also, unrelated, we need to clean up this test to not expect multiple offers i.e. remove numberOfOffers constant.

Attachments

Issue Links

blocks

MESOS-4915 Mesos 0.28.0-rc2 cherry-picks

Resolved

Master sometimes sends two inverse offers after the agent goes into maintenance.

Details

Description

Attachments

Issue Links

Activity

People

Dates