Details
Description
When a quota was set and the master is restarted, removal of the quota reliably leads to a CHECK failure for me.
Start a master:
$ mesos-master --work_dir=work_dir
Set a quota. This creates an implicit role.
$ cat quota.json { "role": "role2", "force": true, "guarantee": [ { "name": "cpus", "type": "SCALAR", "scalar": { "value": 1 } } ] } $ cat quota.json| http POST :5050/quota HTTP/1.1 200 OK Content-Length: 0 Date: Thu, 26 Jan 2017 12:33:38 GMT $ http GET :5050/quota HTTP/1.1 200 OK Content-Length: 108 Content-Type: application/json Date: Thu, 26 Jan 2017 12:33:56 GMT { "infos": [ { "guarantee": [ { "name": "cpus", "role": "*", "scalar": { "value": 1.0 }, "type": "SCALAR" } ], "role": "role2" } ] } $ http GET :5050/roles HTTP/1.1 200 OK Content-Length: 106 Content-Type: application/json Date: Thu, 26 Jan 2017 12:34:10 GMT { "roles": [ { "frameworks": [], "name": "role2", "resources": { "cpus": 0, "disk": 0, "gpus": 0, "mem": 0 }, "weight": 1.0 } ] }
Restart the master process using the same work_dir and attempt to delete the quota after the master is started. The DELETE succeeds with an OK.
$ http DELETE :5050/quota/role2 HTTP/1.1 200 OK Content-Length: 0 Date: Thu, 26 Jan 2017 12:36:04 GMT
After handling the request, the master hits a CHECK failure and is aborted.
$ mesos-master --work_dir=work_dir WARNING: Logging before InitGoogleLogging() is written to STDERR I0126 13:34:57.528599 3145483200 main.cpp:278] Build: 2017-01-23 07:57:34 by bbannier I0126 13:34:57.529131 3145483200 main.cpp:279] Version: 1.2.0 I0126 13:34:57.529139 3145483200 main.cpp:286] Git SHA: dd07d025d40975ec660ed17031d95ec0dba842d2 [warn] kq_init: detected broken kqueue; not using.: No such process I0126 13:34:57.758896 3145483200 main.cpp:385] Using 'HierarchicalDRF' allocator I0126 13:34:57.764276 3145483200 replica.cpp:778] Replica recovered with log positions 3 -> 4 with 0 holes and 0 unlearned I0126 13:34:57.765278 256114688 recover.cpp:451] Starting replica recovery I0126 13:34:57.765547 256114688 recover.cpp:477] Replica is in VOTING status I0126 13:34:57.795964 257187840 master.cpp:383] Master 569073cc-1195-45e9-b0d4-e2e1bf0d13d5 (172.18.9.56) started on 172.18.9.56:5050 I0126 13:34:57.796023 257187840 master.cpp:385] Flags at startup: --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="false" --authenticate_frameworks="false" --authenticate_http_frameworks="false" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --quiet="false" --recovery_agent_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="20secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="work_dir" --zk_session_timeout="10secs" I0126 13:34:57.796478 257187840 master.cpp:437] Master allowing unauthenticated frameworks to register I0126 13:34:57.796507 257187840 master.cpp:451] Master allowing unauthenticated agents to register I0126 13:34:57.796517 257187840 master.cpp:465] Master allowing HTTP frameworks to register without authentication I0126 13:34:57.796540 257187840 master.cpp:507] Using default 'crammd5' authenticator W0126 13:34:57.796573 257187840 authenticator.cpp:512] No credentials provided, authentication requests will be refused I0126 13:34:57.796584 257187840 authenticator.cpp:519] Initializing server SASL I0126 13:34:57.825337 255578112 master.cpp:2121] Elected as the leading master! I0126 13:34:57.825362 255578112 master.cpp:1643] Recovering from registrar I0126 13:34:57.825736 255578112 log.cpp:553] Attempting to start the writer I0126 13:34:57.826889 258260992 replica.cpp:495] Replica received implicit promise request from __req_res__(1)@172.18.9.56:5050 with proposal 2 I0126 13:34:57.828855 258260992 replica.cpp:344] Persisted promised to 2 I0126 13:34:57.829273 258260992 coordinator.cpp:238] Coordinator attempting to fill missing positions I0126 13:34:57.829375 259334144 log.cpp:569] Writer started with ending position 4 I0126 13:34:57.830878 257187840 registrar.cpp:362] Successfully fetched the registry (159B) in 5.427968ms I0126 13:34:57.831029 257187840 registrar.cpp:461] Applied 1 operations in 24us; attempting to update the registry I0126 13:34:57.836194 259334144 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 5 I0126 13:34:57.836676 257724416 replica.cpp:539] Replica received write request for position 5 from __req_res__(2)@172.18.9.56:5050 I0126 13:34:57.837102 255578112 replica.cpp:693] Replica received learned notice for position 5 from @0.0.0.0:0 I0126 13:34:57.837745 257187840 registrar.cpp:506] Successfully updated the registry in 6.685184ms I0126 13:34:57.837806 257187840 registrar.cpp:392] Successfully recovered registrar I0126 13:34:57.837924 255578112 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 6 I0126 13:34:57.838132 256651264 master.cpp:1759] Recovered 0 agents from the registry (159B); allowing 10mins for agents to re-register I0126 13:34:57.838312 257187840 replica.cpp:539] Replica received write request for position 6 from __req_res__(3)@172.18.9.56:5050 I0126 13:34:57.838692 256651264 replica.cpp:693] Replica received learned notice for position 6 from @0.0.0.0:0 I0126 13:36:04.887257 256114688 http.cpp:420] HTTP DELETE for /master/quota/role2 from 127.0.0.1:51458 with User-Agent='HTTPie/0.9.8' I0126 13:36:04.887512 255578112 registrar.cpp:461] Applied 1 operations in 42us; attempting to update the registry I0126 13:36:04.892643 255578112 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 7 I0126 13:36:04.893127 258797568 replica.cpp:539] Replica received write request for position 7 from __req_res__(4)@172.18.9.56:5050 I0126 13:36:04.895309 257187840 replica.cpp:693] Replica received learned notice for position 7 from @0.0.0.0:0 I0126 13:36:04.895814 258260992 registrar.cpp:506] Successfully updated the registry in 8.2688ms F0126 13:36:04.895956 256114688 hierarchical.cpp:1180] Check failed: quotas.contains(role) *** Check failure stack trace: *** I0126 13:36:04.895961 255578112 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 8 I0126 13:36:04.896437 257187840 replica.cpp:539] Replica received write request for position 8 from __req_res__(5)@172.18.9.56:5050 I0126 13:36:04.896908 259334144 replica.cpp:693] Replica received learned notice for position 8 from @0.0.0.0:0 @ 0x10b5e52aa google::LogMessage::Fail() E0126 13:36:04.905042 259870720 process.cpp:2419] Failed to shutdown socket with fd 11: Socket is not connected @ 0x10b5e282c google::LogMessage::SendToLog() @ 0x10b5e3959 google::LogMessage::Flush() @ 0x10b5ee159 google::LogMessageFatal::~LogMessageFatal() @ 0x10b5e5795 google::LogMessageFatal::~LogMessageFatal() @ 0x1089e8d17 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeQuota() @ 0x107ebbc13 _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNSt3__112basic_stringIcNS6_11char_traitsIcEENS6_9allocatorIcEEEESC_EEvRKNS_3PIDIT_EEMSG_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESP_ @ 0x107ebbab0 _ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEESF_EEvRKNS3_3PIDIT_EEMSJ_FvT0_ET1_EUlPNS3_11ProcessBaseEE_SS_EEEvDpOT_ @ 0x107ebb7b9 _ZNSt3__110__function6__funcIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEESE_EEvRKNS2_3PIDIT_EEMSI_FvT0_ET1_EUlPNS2_11ProcessBaseEE_NSC_ISS_EEFvSR_EEclEOSR_ @ 0x10b38ba27 std::__1::function<>::operator()() @ 0x10b38b96c process::ProcessBase::visit() @ 0x10b40415e process::DispatchEvent::visit() @ 0x107665171 process::ProcessBase::serve() @ 0x10b385c07 process::ProcessManager::resume() @ 0x10b47db90 process::ProcessManager::init_threads()::$_0::operator()() @ 0x10b47d7e0 _ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEZN7process14ProcessManager12init_threadsEvE3$_0EEEEEPvSB_ @ 0x7fffb2b8eaab _pthread_body @ 0x7fffb2b8e9f7 _pthread_start @ 0x7fffb2b8e1fd thread_start [2] 59343 abort mesos-master --work_dir=work_dir