Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9555

Allocator CHECK failure: reservationScalarQuantities.contains(role).

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.7.0, 1.7.1
    • Fix Version/s: 1.5.3, 1.6.2, 1.7.2, 1.8.0
    • Component/s: allocation, master
    • Labels:
      None
    • Environment:
      • Mesos 1.5
      • DISTRIB_ID=Ubuntu
      • DISTRIB_RELEASE=16.04
      • DISTRIB_CODENAME=xenial
      • DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"
    • Target Version/s:
    • Sprint:
      Resource Mgmt RI10 Sp 39, Resource Mgmt RI11 Sp 40
    • Story Points:
      3

      Description

      We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since then have been getting periodic master crashes due to this error:

      Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 hierarchical.cpp:2630] Check failed: reservationScalarQuantities.contains(role)

      Full stack trace is at the end of this issue description. When the master fails, we automatically restart it and it rejoins the cluster just fine. I did some initial searching and was unable to find any existing bug reports or other people experiencing this issue. We run a cluster of 3 masters, and see crashes on all 3 instances.

      Right before the crash, we saw a Removed agent:... log line noting that it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed.

      294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 15:53:57.384759 8432 master.cpp:9893] Removed agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 (10.0.18.78): the agent unregistered

      I saved the full log from the master, so happy to provide more info from it, or anything else about our current environment.

      Full stack trace is below.

      Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d google::LogMessage::Fail()
      Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 google::LogMessage::SendToLog()
      Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 google::LogMessage::Flush()
      Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 google::LogMessageFatal::~LogMessageFatal()
      Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
      Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
      Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 process::ProcessBase::consume()
      Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a process::ProcessManager::resume()
      Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
      Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown)
      Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba start_thread
      Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d (unknown)

        Attachments

        1. 0001-Fixed-an-allocator-crash-during-reservation-tracking.patch
          2 kB
          Benjamin Mahler
        2. mesos_leader.log
          783 kB
          David Wilemski
        3. mesos.log
          14 kB
          David Wilemski
        4. 0001-Added-additional-logging-to-1.5.2-to-investigate-MES.patch
          5 kB
          Benjamin Mahler

          Activity

            People

            • Assignee:
              bmahler Benjamin Mahler
              Reporter:
              fluxx Jeff Pollard
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: