Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7008

Quota not recovered from registry in empty cluster.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.0.2, 1.1.0
    • 1.0.4, 1.1.1, 1.2.0
    • allocation, master
    • OS X, SSL build

    • Mesosphere Sprint 50
    • 3

    Description

      When a quota was set and the master is restarted, removal of the quota reliably leads to a CHECK failure for me.

      Start a master:

      $ mesos-master --work_dir=work_dir
      

      Set a quota. This creates an implicit role.

      $ cat quota.json
      {
          "role": "role2",
          "force": true,
          "guarantee": [
              {
                  "name": "cpus",
                  "type": "SCALAR",
                  "scalar": { "value": 1 }
              }
          ]
      }
      
      $ cat quota.json| http POST :5050/quota
      HTTP/1.1 200 OK
      Content-Length: 0
      Date: Thu, 26 Jan 2017 12:33:38 GMT
      
      $ http GET :5050/quota
      HTTP/1.1 200 OK
      Content-Length: 108
      Content-Type: application/json
      Date: Thu, 26 Jan 2017 12:33:56 GMT
      
      {
          "infos": [
              {
                  "guarantee": [
                      {
                          "name": "cpus",
                          "role": "*",
                          "scalar": {
                              "value": 1.0
                          },
                          "type": "SCALAR"
                      }
                  ],
                  "role": "role2"
              }
          ]
      }
      
      $ http GET :5050/roles
      HTTP/1.1 200 OK
      Content-Length: 106
      Content-Type: application/json
      Date: Thu, 26 Jan 2017 12:34:10 GMT
      
      {
          "roles": [
              {
                  "frameworks": [],
                  "name": "role2",
                  "resources": {
                      "cpus": 0,
                      "disk": 0,
                      "gpus": 0,
                      "mem": 0
                  },
                  "weight": 1.0
              }
          ]
      }
      

      Restart the master process using the same work_dir and attempt to delete the quota after the master is started. The DELETE succeeds with an OK.

      $ http DELETE :5050/quota/role2
      HTTP/1.1 200 OK
      Content-Length: 0
      Date: Thu, 26 Jan 2017 12:36:04 GMT
      

      After handling the request, the master hits a CHECK failure and is aborted.

      $ mesos-master --work_dir=work_dir
      WARNING: Logging before InitGoogleLogging() is written to STDERR
      I0126 13:34:57.528599 3145483200 main.cpp:278] Build: 2017-01-23 07:57:34 by bbannier
      I0126 13:34:57.529131 3145483200 main.cpp:279] Version: 1.2.0
      I0126 13:34:57.529139 3145483200 main.cpp:286] Git SHA: dd07d025d40975ec660ed17031d95ec0dba842d2
      [warn] kq_init: detected broken kqueue; not using.: No such process
      I0126 13:34:57.758896 3145483200 main.cpp:385] Using 'HierarchicalDRF' allocator
      I0126 13:34:57.764276 3145483200 replica.cpp:778] Replica recovered with log positions 3 -> 4 with 0 holes and 0 unlearned
      I0126 13:34:57.765278 256114688 recover.cpp:451] Starting replica recovery
      I0126 13:34:57.765547 256114688 recover.cpp:477] Replica is in VOTING status
      I0126 13:34:57.795964 257187840 master.cpp:383] Master 569073cc-1195-45e9-b0d4-e2e1bf0d13d5 (172.18.9.56) started on 172.18.9.56:5050
      I0126 13:34:57.796023 257187840 master.cpp:385] Flags at startup: --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="false" --authenticate_frameworks="false" --authenticate_http_frameworks="false" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --quiet="false" --recovery_agent_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="20secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="work_dir" --zk_session_timeout="10secs"
      I0126 13:34:57.796478 257187840 master.cpp:437] Master allowing unauthenticated frameworks to register
      I0126 13:34:57.796507 257187840 master.cpp:451] Master allowing unauthenticated agents to register
      I0126 13:34:57.796517 257187840 master.cpp:465] Master allowing HTTP frameworks to register without authentication
      I0126 13:34:57.796540 257187840 master.cpp:507] Using default 'crammd5' authenticator
      W0126 13:34:57.796573 257187840 authenticator.cpp:512] No credentials provided, authentication requests will be refused
      I0126 13:34:57.796584 257187840 authenticator.cpp:519] Initializing server SASL
      I0126 13:34:57.825337 255578112 master.cpp:2121] Elected as the leading master!
      I0126 13:34:57.825362 255578112 master.cpp:1643] Recovering from registrar
      I0126 13:34:57.825736 255578112 log.cpp:553] Attempting to start the writer
      I0126 13:34:57.826889 258260992 replica.cpp:495] Replica received implicit promise request from __req_res__(1)@172.18.9.56:5050 with proposal 2
      I0126 13:34:57.828855 258260992 replica.cpp:344] Persisted promised to 2
      I0126 13:34:57.829273 258260992 coordinator.cpp:238] Coordinator attempting to fill missing positions
      I0126 13:34:57.829375 259334144 log.cpp:569] Writer started with ending position 4
      I0126 13:34:57.830878 257187840 registrar.cpp:362] Successfully fetched the registry (159B) in 5.427968ms
      I0126 13:34:57.831029 257187840 registrar.cpp:461] Applied 1 operations in 24us; attempting to update the registry
      I0126 13:34:57.836194 259334144 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 5
      I0126 13:34:57.836676 257724416 replica.cpp:539] Replica received write request for position 5 from __req_res__(2)@172.18.9.56:5050
      I0126 13:34:57.837102 255578112 replica.cpp:693] Replica received learned notice for position 5 from @0.0.0.0:0
      I0126 13:34:57.837745 257187840 registrar.cpp:506] Successfully updated the registry in 6.685184ms
      I0126 13:34:57.837806 257187840 registrar.cpp:392] Successfully recovered registrar
      I0126 13:34:57.837924 255578112 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 6
      I0126 13:34:57.838132 256651264 master.cpp:1759] Recovered 0 agents from the registry (159B); allowing 10mins for agents to re-register
      I0126 13:34:57.838312 257187840 replica.cpp:539] Replica received write request for position 6 from __req_res__(3)@172.18.9.56:5050
      I0126 13:34:57.838692 256651264 replica.cpp:693] Replica received learned notice for position 6 from @0.0.0.0:0
      I0126 13:36:04.887257 256114688 http.cpp:420] HTTP DELETE for /master/quota/role2 from 127.0.0.1:51458 with User-Agent='HTTPie/0.9.8'
      I0126 13:36:04.887512 255578112 registrar.cpp:461] Applied 1 operations in 42us; attempting to update the registry
      I0126 13:36:04.892643 255578112 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 7
      I0126 13:36:04.893127 258797568 replica.cpp:539] Replica received write request for position 7 from __req_res__(4)@172.18.9.56:5050
      I0126 13:36:04.895309 257187840 replica.cpp:693] Replica received learned notice for position 7 from @0.0.0.0:0
      I0126 13:36:04.895814 258260992 registrar.cpp:506] Successfully updated the registry in 8.2688ms
      F0126 13:36:04.895956 256114688 hierarchical.cpp:1180] Check failed: quotas.contains(role)
      *** Check failure stack trace: ***
      I0126 13:36:04.895961 255578112 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 8
      I0126 13:36:04.896437 257187840 replica.cpp:539] Replica received write request for position 8 from __req_res__(5)@172.18.9.56:5050
      I0126 13:36:04.896908 259334144 replica.cpp:693] Replica received learned notice for position 8 from @0.0.0.0:0
          @        0x10b5e52aa  google::LogMessage::Fail()
      E0126 13:36:04.905042 259870720 process.cpp:2419] Failed to shutdown socket with fd 11: Socket is not connected
          @        0x10b5e282c  google::LogMessage::SendToLog()
          @        0x10b5e3959  google::LogMessage::Flush()
          @        0x10b5ee159  google::LogMessageFatal::~LogMessageFatal()
          @        0x10b5e5795  google::LogMessageFatal::~LogMessageFatal()
          @        0x1089e8d17  mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeQuota()
          @        0x107ebbc13  _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNSt3__112basic_stringIcNS6_11char_traitsIcEENS6_9allocatorIcEEEESC_EEvRKNS_3PIDIT_EEMSG_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESP_
          @        0x107ebbab0  _ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEESF_EEvRKNS3_3PIDIT_EEMSJ_FvT0_ET1_EUlPNS3_11ProcessBaseEE_SS_EEEvDpOT_
          @        0x107ebb7b9  _ZNSt3__110__function6__funcIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEESE_EEvRKNS2_3PIDIT_EEMSI_FvT0_ET1_EUlPNS2_11ProcessBaseEE_NSC_ISS_EEFvSR_EEclEOSR_
          @        0x10b38ba27  std::__1::function<>::operator()()
          @        0x10b38b96c  process::ProcessBase::visit()
          @        0x10b40415e  process::DispatchEvent::visit()
          @        0x107665171  process::ProcessBase::serve()
          @        0x10b385c07  process::ProcessManager::resume()
          @        0x10b47db90  process::ProcessManager::init_threads()::$_0::operator()()
          @        0x10b47d7e0  _ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEZN7process14ProcessManager12init_threadsEvE3$_0EEEEEPvSB_
          @     0x7fffb2b8eaab  _pthread_body
          @     0x7fffb2b8e9f7  _pthread_start
          @     0x7fffb2b8e1fd  thread_start
      [2]    59343 abort      mesos-master --work_dir=work_dir
      

      Attachments

        Activity

          People

            neilc Neil Conway
            bbannier Benjamin Bannier
            Alex R Alex R
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: