Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
1.9.0
-
None
-
Resource Mgmt: RI-19 56, Resource Mgmt: RI-19 57
-
3
Description
We are observing the following crash on the 1.9.1 master:
I1008 10:12:15.148486 4687 http.cpp:1115] HTTP POST for /master/api/v1?_ts=1570529541073&UPDATE_QUOTA from 10.0.7.253:35410 with User-Agent='Mozilla/5.0 (Windows NT 6.1; Win64; x64) Ap> I1008 10:12:15.148665 4687 http.cpp:263] Processing call UPDATE_QUOTA I1008 10:12:15.148756 4687 quota_handler.cpp:1136] Authorizing principal 'bootstrapuser' to update quota config for role 's1' I1008 10:12:15.149169 4685 registrar.cpp:487] Applied 1 operations in 56277ns; attempting to update the registry I1008 10:12:15.149338 4681 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 13 I1008 10:12:15.149467 4689 replica.cpp:541] Replica received write request for position 13 from __req_res__(29)@10.0.7.253:5050 I1008 10:12:15.151820 4683 replica.cpp:695] Replica received learned notice for position 13 from log-network(2)@10.0.7.253:5050 I1008 10:12:15.153559 4679 registrar.cpp:544] Successfully updated the registry in 4.348928ms I1008 10:12:15.153592 4678 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 14 I1008 10:12:15.153715 4679 hierarchical.cpp:1619] Updated quota for role 's1', guarantees: {} limits: cpus:2; disk:-9.22337203685478e+15; gpus:3; mem:1000000000000 I1008 10:12:15.153796 4677 replica.cpp:541] Replica received write request for position 14 from __req_res__(30)@10.0.7.253:5050 I1008 10:12:15.155380 4691 replica.cpp:695] Replica received learned notice for position 14 from log-network(2)@10.0.7.253:5050 I1008 10:12:15.249722 4677 authenticator.cpp:324] dstip=10.0.7.253 type=audit timestamp=2019-10-08 10:12:15.249673984+00:00 reason="Valid authentication token" uid="bootstrapuser" obje> I1008 10:12:15.249956 4682 http.cpp:1115] HTTP GET for /master/state-summary?_ts=1570529541169 from 10.0.7.253:35414 with User-Agent='Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebK> I1008 10:12:15.250633 4691 http.cpp:1132] HTTP GET for /master/state-summary?_ts=1570529541169 from 10.0.7.253:35414: '200 OK' after 1.72621ms I1008 10:12:15.570379 4689 hierarchical.cpp:1908] Before allocation, required quota headroom is {} and available quota headroom is cpus:0.9; disk:75853; mem:5507 F1008 10:12:15.570580 4689 resource_quantities.cpp:330] Check failed: scalar >= Value::Scalar() (-9.22337203685478e+15 vs. 0) *** Check failure stack trace: *** @ 0x7fc786f0148d google::LogMessage::Fail() @ 0x7fc786f036e8 google::LogMessage::SendToLog() @ 0x7fc786f01023 google::LogMessage::Flush() @ 0x7fc786f04029 google::LogMessageFatal::~LogMessageFatal() @ 0x7fc785954dfa mesos::ResourceQuantities::add() @ 0x7fc785954fb6 mesos::ResourceQuantities::fromScalarResource() @ 0x7fc78595e135 mesos::shrinkResources() @ 0x7fc785a874a9 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::__allocate() @ 0x7fc785a88089 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::_allocate() @ 0x7fc785a93882 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI7NothingN5mesos8internal6master9allocator8internal28Hier> @ 0x7fc786e49e21 process::ProcessBase::consume() @ 0x7fc786e6141b process::ProcessManager::resume() @ 0x7fc786e670b6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv @ 0x7fc782a28b22 (unknown) @ 0x7fc7821be94a (unknown) @ 0x7fc781eef07f clone
Note that the value of disk quota limit is logged as "negative".
Update: we figured out that in reality the quota limit on that master has been set to an insanely large value.
The situation is exacerbated by the fact that the crash is not guaranteed to occur immediately, i.e. these values might become persisted in the registry.