Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-10008

Very large quota values can crash master.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.9.0
    • Fix Version/s: 1.9.1, 1.10.0
    • Component/s: None
    • Target Version/s:
    • Sprint:
      Resource Mgmt: RI-19 56, Resource Mgmt: RI-19 57
    • Story Points:
      3

      Description

      We are observing the following crash on the 1.9.1 master:

      I1008 10:12:15.148486  4687 http.cpp:1115] HTTP POST for /master/api/v1?_ts=1570529541073&UPDATE_QUOTA from 10.0.7.253:35410 with User-Agent='Mozilla/5.0 (Windows NT 6.1; Win64; x64) Ap>
      I1008 10:12:15.148665  4687 http.cpp:263] Processing call UPDATE_QUOTA
      I1008 10:12:15.148756  4687 quota_handler.cpp:1136] Authorizing principal 'bootstrapuser' to update quota config for role 's1'
      I1008 10:12:15.149169  4685 registrar.cpp:487] Applied 1 operations in 56277ns; attempting to update the registry
      I1008 10:12:15.149338  4681 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 13
      I1008 10:12:15.149467  4689 replica.cpp:541] Replica received write request for position 13 from __req_res__(29)@10.0.7.253:5050
      I1008 10:12:15.151820  4683 replica.cpp:695] Replica received learned notice for position 13 from log-network(2)@10.0.7.253:5050
      I1008 10:12:15.153559  4679 registrar.cpp:544] Successfully updated the registry in 4.348928ms
      I1008 10:12:15.153592  4678 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 14
      I1008 10:12:15.153715  4679 hierarchical.cpp:1619] Updated quota for role 's1',  guarantees: {} limits: cpus:2; disk:-9.22337203685478e+15; gpus:3; mem:1000000000000
      I1008 10:12:15.153796  4677 replica.cpp:541] Replica received write request for position 14 from __req_res__(30)@10.0.7.253:5050
      I1008 10:12:15.155380  4691 replica.cpp:695] Replica received learned notice for position 14 from log-network(2)@10.0.7.253:5050
      I1008 10:12:15.249722  4677 authenticator.cpp:324] dstip=10.0.7.253 type=audit timestamp=2019-10-08 10:12:15.249673984+00:00 reason="Valid authentication token" uid="bootstrapuser" obje>
      I1008 10:12:15.249956  4682 http.cpp:1115] HTTP GET for /master/state-summary?_ts=1570529541169 from 10.0.7.253:35414 with User-Agent='Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebK>
      I1008 10:12:15.250633  4691 http.cpp:1132] HTTP GET for /master/state-summary?_ts=1570529541169 from 10.0.7.253:35414: '200 OK' after 1.72621ms
      I1008 10:12:15.570379  4689 hierarchical.cpp:1908] Before allocation, required quota headroom is {} and available quota headroom is cpus:0.9; disk:75853; mem:5507
      F1008 10:12:15.570580  4689 resource_quantities.cpp:330] Check failed: scalar >= Value::Scalar() (-9.22337203685478e+15 vs. 0)
      *** Check failure stack trace: ***
          @     0x7fc786f0148d  google::LogMessage::Fail()
          @     0x7fc786f036e8  google::LogMessage::SendToLog()
          @     0x7fc786f01023  google::LogMessage::Flush()
          @     0x7fc786f04029  google::LogMessageFatal::~LogMessageFatal()
          @     0x7fc785954dfa  mesos::ResourceQuantities::add()
          @     0x7fc785954fb6  mesos::ResourceQuantities::fromScalarResource()
          @     0x7fc78595e135  mesos::shrinkResources()
          @     0x7fc785a874a9  mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::__allocate()
          @     0x7fc785a88089  mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::_allocate()
          @     0x7fc785a93882  _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI7NothingN5mesos8internal6master9allocator8internal28Hier>
          @     0x7fc786e49e21  process::ProcessBase::consume()
          @     0x7fc786e6141b  process::ProcessManager::resume()
          @     0x7fc786e670b6  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
          @     0x7fc782a28b22  (unknown)
          @     0x7fc7821be94a  (unknown)
          @     0x7fc781eef07f  clone
      

      Note that the value of disk quota limit is logged as "negative".

      Update: we figured out that in reality the quota limit on that master has been set to an insanely large value.

      The situation is exacerbated by the fact that the crash is not guaranteed to occur immediately, i.e. these values might become persisted in the registry.

        Attachments

          Activity

            People

            • Assignee:
              bmahler Benjamin Mahler
              Reporter:
              asekretenko Andrei Sekretenko
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: