Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-1482

SlaveRecoveryTest/0.RecoverTerminatedExecutor fails on registry store

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • None
    • None
    • test
    • None
    • Q2 Sprint 4

    Description

      [ RUN      ] SlaveRecoveryTest/0.RecoverTerminatedExecutor
      Using temporary directory '/tmp/SlaveRecoveryTest_0_RecoverTerminatedExecutor_YQ24za'
      I0612 14:22:27.358382  5908 leveldb.cpp:176] Opened db in 767016ns
      I0612 14:22:27.359216  5908 leveldb.cpp:183] Compacted db in 222074ns
      I0612 14:22:27.359726  5908 leveldb.cpp:198] Created db iterator in 5441ns
      I0612 14:22:27.360497  5908 leveldb.cpp:204] Seeked to beginning of db in 1303ns
      I0612 14:22:27.360986  5908 leveldb.cpp:273] Iterated through 0 keys in the db in 755ns
      I0612 14:22:27.361481  5908 replica.cpp:741] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
      I0612 14:22:27.362094  5927 recover.cpp:425] Starting replica recovery
      I0612 14:22:27.362175  5927 recover.cpp:451] Replica is in EMPTY status
      I0612 14:22:27.362709  5934 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I0612 14:22:27.362819  5927 recover.cpp:188] Received a recover response from a replica in EMPTY status
      I0612 14:22:27.362912  5927 recover.cpp:542] Updating replica status to STARTING
      I0612 14:22:27.363024  5927 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 41670ns
      I0612 14:22:27.363039  5927 replica.cpp:320] Persisted replica status to STARTING
      I0612 14:22:27.363092  5927 recover.cpp:451] Replica is in STARTING status
      I0612 14:22:27.363404  5927 replica.cpp:638] Replica in STARTING status received a broadcasted recover request
      I0612 14:22:27.363462  5927 recover.cpp:188] Received a recover response from a replica in STARTING status
      I0612 14:22:27.363559  5927 recover.cpp:542] Updating replica status to VOTING
      I0612 14:22:27.363622  5927 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 16124ns
      I0612 14:22:27.363634  5927 replica.cpp:320] Persisted replica status to VOTING
      I0612 14:22:27.363663  5927 recover.cpp:556] Successfully joined the Paxos group
      I0612 14:22:27.363708  5927 recover.cpp:440] Recover process terminated
      I0612 14:22:27.368589  5930 master.cpp:280] Master 20140612-142227-2272962752-41488-5908 (fedora-19) started on 192.168.122.135:41488
      I0612 14:22:27.368618  5930 master.cpp:317] Master only allowing authenticated frameworks to register
      I0612 14:22:27.368626  5930 master.cpp:322] Master only allowing authenticated slaves to register
      I0612 14:22:27.368634  5930 credentials.hpp:35] Loading credentials for authentication from '/tmp/SlaveRecoveryTest_0_RecoverTerminatedExecutor_YQ24za/credentials'
      I0612 14:22:27.368711  5930 master.cpp:348] Authorization enabled
      I0612 14:22:27.369092  5930 hierarchical_allocator_process.hpp:301] Initializing hierarchical allocator process with master : master@192.168.122.135:41488
      I0612 14:22:27.369130  5930 master.cpp:114] No whitelist given. Advertising offers for all slaves
      I0612 14:22:27.369279  5930 master.cpp:961] The newly elected leader is master@192.168.122.135:41488 with id 20140612-142227-2272962752-41488-5908
      I0612 14:22:27.369593  5930 master.cpp:974] Elected as the leading master!
      I0612 14:22:27.369607  5930 master.cpp:792] Recovering from registrar
      I0612 14:22:27.369669  5930 registrar.cpp:313] Recovering registrar
      I0612 14:22:27.369881  5930 log.cpp:656] Attempting to start the writer
      I0612 14:22:27.370656  5930 replica.cpp:474] Replica received implicit promise request with proposal 1
      I0612 14:22:27.371258  5930 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 25467ns
      I0612 14:22:27.371793  5930 replica.cpp:342] Persisted promised to 1
      I0612 14:22:27.372181  5930 coordinator.cpp:230] Coordinator attemping to fill missing position
      I0612 14:22:27.373234  5932 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2
      I0612 14:22:27.373282  5932 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 24371ns
      I0612 14:22:27.373404  5932 replica.cpp:676] Persisted action at 0
      I0612 14:22:27.373643  5932 replica.cpp:508] Replica received write request for position 0
      I0612 14:22:27.373673  5932 leveldb.cpp:438] Reading position from leveldb took 13891ns
      I0612 14:22:27.373700  5932 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 13231ns
      I0612 14:22:27.373710  5932 replica.cpp:676] Persisted action at 0
      I0612 14:22:27.373826  5932 replica.cpp:655] Replica received learned notice for position 0
      I0612 14:22:27.373852  5932 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 13032ns
      I0612 14:22:27.373862  5932 replica.cpp:676] Persisted action at 0
      I0612 14:22:27.373869  5932 replica.cpp:661] Replica learned NOP action at position 0
      I0612 14:22:27.374548  5930 log.cpp:672] Writer started with ending position 0
      I0612 14:22:27.375668  5928 leveldb.cpp:438] Reading position from leveldb took 12278ns
      I0612 14:22:27.381489  5933 registrar.cpp:346] Successfully fetched the registry (0B)
      I0612 14:22:27.381528  5933 registrar.cpp:422] Attempting to update the 'registry'
      I0612 14:22:27.383139  5933 log.cpp:680] Attempting to append 130 bytes to the log
      I0612 14:22:27.383199  5933 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 1
      I0612 14:22:27.383883  5931 replica.cpp:508] Replica received write request for position 1
      I0612 14:22:27.383942  5931 leveldb.cpp:343] Persisting action (149 bytes) to leveldb took 33381ns
      I0612 14:22:27.383955  5931 replica.cpp:676] Persisted action at 1
      I0612 14:22:27.384624  5927 replica.cpp:655] Replica received learned notice for position 1
      I0612 14:22:27.384665  5927 leveldb.cpp:343] Persisting action (151 bytes) to leveldb took 18894ns
      I0612 14:22:27.384675  5927 replica.cpp:676] Persisted action at 1
      I0612 14:22:27.384683  5927 replica.cpp:661] Replica learned APPEND action at position 1
      I0612 14:22:27.387426  5931 registrar.cpp:479] Successfully updated 'registry'
      I0612 14:22:27.387465  5931 registrar.cpp:372] Successfully recovered registrar
      I0612 14:22:27.387532  5931 master.cpp:819] Recovered 0 slaves from the Registry (94B) ; allowing 10mins for slaves to re-register
      I0612 14:22:27.388017  5932 log.cpp:699] Attempting to truncate the log to 1
      I0612 14:22:27.388064  5932 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 2
      I0612 14:22:27.388219  5932 replica.cpp:508] Replica received write request for position 2
      I0612 14:22:27.388254  5932 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 17772ns
      I0612 14:22:27.388265  5932 replica.cpp:676] Persisted action at 2
      I0612 14:22:27.388435  5932 replica.cpp:655] Replica received learned notice for position 2
      I0612 14:22:27.388466  5932 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 14911ns
      I0612 14:22:27.388483  5932 leveldb.cpp:401] Deleting ~1 keys from leveldb took 5312ns
      I0612 14:22:27.388490  5932 replica.cpp:676] Persisted action at 2
      I0612 14:22:27.388499  5932 replica.cpp:661] Replica learned TRUNCATE action at position 2
      I0612 14:22:27.391860  5908 mesos_containerizer.cpp:124] Using isolation: posix/cpu,posix/mem
      I0612 14:22:27.395467  5934 slave.cpp:143] Slave started on 36)@192.168.122.135:41488
      I0612 14:22:27.395489  5934 credentials.hpp:35] Loading credentials for authentication from '/tmp/SlaveRecoveryTest_0_RecoverTerminatedExecutor_yetcHO/credential'
      I0612 14:22:27.395561  5934 slave.cpp:243] Slave using credential for: test-principal
      I0612 14:22:27.395660  5934 slave.cpp:256] Slave resources: cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000]
      I0612 14:22:27.396206  5934 slave.cpp:284] Slave hostname: fedora-19
      I0612 14:22:27.396225  5934 slave.cpp:285] Slave checkpoint: true
      I0612 14:22:27.396888  5934 state.cpp:33] Recovering state from '/tmp/SlaveRecoveryTest_0_RecoverTerminatedExecutor_yetcHO/meta'
      I0612 14:22:27.396986  5934 status_update_manager.cpp:193] Recovering status update manager
      I0612 14:22:27.397034  5934 mesos_containerizer.cpp:281] Recovering containerizer
      I0612 14:22:27.397214  5934 slave.cpp:3024] Finished recovery
      I0612 14:22:27.397889  5934 slave.cpp:537] New master detected at master@192.168.122.135:41488
      I0612 14:22:27.398550  5934 slave.cpp:613] Authenticating with master master@192.168.122.135:41488
      I0612 14:22:27.398824  5930 status_update_manager.cpp:167] New master detected at master@192.168.122.135:41488
      I0612 14:22:27.399489  5932 authenticatee.hpp:128] Creating new client SASL connection
      I0612 14:22:27.400174  5932 master.cpp:3341] Authenticating slave(36)@192.168.122.135:41488
      I0612 14:22:27.400259  5932 authenticator.hpp:156] Creating new server SASL connection
      I0612 14:22:27.400812  5932 authenticatee.hpp:219] Received SASL authentication mechanisms: CRAM-MD5
      I0612 14:22:27.400836  5932 authenticatee.hpp:245] Attempting to authenticate with mechanism 'CRAM-MD5'
      I0612 14:22:27.400861  5932 authenticator.hpp:262] Received SASL authentication start
      I0612 14:22:27.400899  5932 authenticator.hpp:384] Authentication requires more steps
      I0612 14:22:27.400926  5932 authenticatee.hpp:265] Received SASL authentication step
      I0612 14:22:27.400961  5932 authenticator.hpp:290] Received SASL authentication step
      I0612 14:22:27.400979  5932 auxprop.cpp:81] Request to lookup properties for user: 'test-principal' realm: 'fedora-19' server FQDN: 'fedora-19' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: false 
      I0612 14:22:27.400986  5932 auxprop.cpp:153] Looking up auxiliary property '*userPassword'
      I0612 14:22:27.400998  5932 auxprop.cpp:153] Looking up auxiliary property '*cmusaslsecretCRAM-MD5'
      I0612 14:22:27.401007  5932 auxprop.cpp:81] Request to lookup properties for user: 'test-principal' realm: 'fedora-19' server FQDN: 'fedora-19' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: true 
      I0612 14:22:27.401015  5932 auxprop.cpp:103] Skipping auxiliary property '*userPassword' since SASL_AUXPROP_AUTHZID == true
      I0612 14:22:27.401021  5932 auxprop.cpp:103] Skipping auxiliary property '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true
      I0612 14:22:27.401031  5932 authenticator.hpp:376] Authentication success
      I0612 14:22:27.401057  5932 authenticatee.hpp:305] Authentication success
      I0612 14:22:27.401077  5932 master.cpp:3381] Successfully authenticated principal 'test-principal' at slave(36)@192.168.122.135:41488
      I0612 14:22:27.401557  5934 slave.cpp:586] Detecting new master
      I0612 14:22:27.401631  5934 slave.cpp:670] Successfully authenticated with master master@192.168.122.135:41488
      I0612 14:22:27.401676  5934 slave.cpp:908] Will retry registration in 8.328149ms if necessary
      I0612 14:22:27.402134  5932 master.cpp:2620] Registering slave at slave(36)@192.168.122.135:41488 (fedora-19) with id 20140612-142227-2272962752-41488-5908-0
      I0612 14:22:27.402240  5932 registrar.cpp:422] Attempting to update the 'registry'
      I0612 14:22:27.410532  5908 sched.cpp:128] Version: 0.20.0
      I0612 14:22:27.411260  5929 slave.cpp:908] Will retry registration in 8.25517ms if necessary
      I0612 14:22:27.411367  5928 master.cpp:2608] Ignoring register slave message from slave(36)@192.168.122.135:41488 (fedora-19) as admission is already in progress
      I0612 14:22:27.412156  5932 sched.cpp:224] New master detected at master@192.168.122.135:41488
      I0612 14:22:27.412184  5932 sched.cpp:274] Authenticating with master master@192.168.122.135:41488
      I0612 14:22:27.412269  5932 authenticatee.hpp:128] Creating new client SASL connection
      I0612 14:22:27.413010  5931 master.cpp:3341] Authenticating scheduler(35)@192.168.122.135:41488
      I0612 14:22:27.413103  5931 authenticator.hpp:156] Creating new server SASL connection
      I0612 14:22:27.413966  5932 authenticatee.hpp:219] Received SASL authentication mechanisms: CRAM-MD5
      I0612 14:22:27.413993  5932 authenticatee.hpp:245] Attempting to authenticate with mechanism 'CRAM-MD5'
      I0612 14:22:27.414026  5932 authenticator.hpp:262] Received SASL authentication start
      I0612 14:22:27.414067  5932 authenticator.hpp:384] Authentication requires more steps
      I0612 14:22:27.414105  5932 authenticatee.hpp:265] Received SASL authentication step
      I0612 14:22:27.414170  5932 authenticator.hpp:290] Received SASL authentication step
      I0612 14:22:27.414202  5932 auxprop.cpp:81] Request to lookup properties for user: 'test-principal' realm: 'fedora-19' server FQDN: 'fedora-19' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: false 
      I0612 14:22:27.414222  5932 auxprop.cpp:153] Looking up auxiliary property '*userPassword'
      I0612 14:22:27.414242  5932 auxprop.cpp:153] Looking up auxiliary property '*cmusaslsecretCRAM-MD5'
      I0612 14:22:27.414263  5932 auxprop.cpp:81] Request to lookup properties for user: 'test-principal' realm: 'fedora-19' server FQDN: 'fedora-19' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: true 
      I0612 14:22:27.414278  5932 auxprop.cpp:103] Skipping auxiliary property '*userPassword' since SASL_AUXPROP_AUTHZID == true
      I0612 14:22:27.414778  5932 auxprop.cpp:103] Skipping auxiliary property '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true
      I0612 14:22:27.414808  5932 authenticator.hpp:376] Authentication success
      I0612 14:22:27.415191  5929 authenticatee.hpp:305] Authentication success
      I0612 14:22:27.415879  5927 sched.cpp:348] Successfully authenticated with master master@192.168.122.135:41488
      I0612 14:22:27.415909  5927 sched.cpp:467] Sending registration request to master@192.168.122.135:41488
      I0612 14:22:27.416354  5930 master.cpp:3381] Successfully authenticated principal 'test-principal' at scheduler(35)@192.168.122.135:41488
      I0612 14:22:27.416837  5930 master.cpp:1080] Received registration request from scheduler(35)@192.168.122.135:41488
      I0612 14:22:27.416878  5930 master.cpp:1040] Authorizing framework principal 'test-principal' to receive offers for role '*'
      I0612 14:22:27.417507  5930 master.cpp:1139] Registering framework 20140612-142227-2272962752-41488-5908-0000 at scheduler(35)@192.168.122.135:41488
      I0612 14:22:27.417613  5930 sched.cpp:398] Framework registered with 20140612-142227-2272962752-41488-5908-0000
      I0612 14:22:27.417650  5930 sched.cpp:412] Scheduler::registered took 14831ns
      I0612 14:22:27.417704  5930 hierarchical_allocator_process.hpp:331] Added framework 20140612-142227-2272962752-41488-5908-0000
      I0612 14:22:27.417721  5930 hierarchical_allocator_process.hpp:724] No resources available to allocate!
      I0612 14:22:27.417733  5930 hierarchical_allocator_process.hpp:686] Performed allocation for 0 slaves in 13083ns
      I0612 14:22:27.421093  5929 slave.cpp:908] Will retry registration in 75.800166ms if necessary
      I0612 14:22:27.421195  5929 master.cpp:2608] Ignoring register slave message from slave(36)@192.168.122.135:41488 (fedora-19) as admission is already in progress
      I0612 14:22:27.497810  5929 slave.cpp:908] Will retry registration in 158.537849ms if necessary
      I0612 14:22:27.497961  5929 master.cpp:2608] Ignoring register slave message from slave(36)@192.168.122.135:41488 (fedora-19) as admission is already in progress
      I0612 14:22:27.657577  5929 slave.cpp:908] Will retry registration in 309.183804ms if necessary
      I0612 14:22:27.657702  5929 master.cpp:2608] Ignoring register slave message from slave(36)@192.168.122.135:41488 (fedora-19) as admission is already in progress
      I0612 14:22:27.967582  5929 slave.cpp:908] Will retry registration in 332.710217ms if necessary
      I0612 14:22:27.967701  5929 master.cpp:2608] Ignoring register slave message from slave(36)@192.168.122.135:41488 (fedora-19) as admission is already in progress
      I0612 14:22:28.301319  5929 slave.cpp:908] Will retry registration in 332.949975ms if necessary
      I0612 14:22:28.301476  5929 master.cpp:2608] Ignoring register slave message from slave(36)@192.168.122.135:41488 (fedora-19) as admission is already in progress
      I0612 14:22:28.369415  5929 hierarchical_allocator_process.hpp:724] No resources available to allocate!
      I0612 14:22:28.369457  5929 hierarchical_allocator_process.hpp:686] Performed allocation for 0 slaves in 55655ns
      I0612 14:22:28.634922  5929 slave.cpp:908] Will retry registration in 1.909975292secs if necessary
      I0612 14:22:28.635051  5929 master.cpp:2608] Ignoring register slave message from slave(36)@192.168.122.135:41488 (fedora-19) as admission is already in progress
      I0612 14:22:29.369622  5929 hierarchical_allocator_process.hpp:724] No resources available to allocate!
      I0612 14:22:29.369673  5929 hierarchical_allocator_process.hpp:686] Performed allocation for 0 slaves in 63149ns
      2014-06-12 14:22:29,391:5908(0x7fc2e44b7700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:48451] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      I0612 14:22:30.370820  5929 hierarchical_allocator_process.hpp:724] No resources available to allocate!
      I0612 14:22:30.370887  5929 hierarchical_allocator_process.hpp:686] Performed allocation for 0 slaves in 75508ns
      I0612 14:22:30.545195  5929 slave.cpp:908] Will retry registration in 4.490573386secs if necessary
      I0612 14:22:30.545390  5929 master.cpp:2608] Ignoring register slave message from slave(36)@192.168.122.135:41488 (fedora-19) as admission is already in progress
      I0612 14:22:31.371768  5929 hierarchical_allocator_process.hpp:724] No resources available to allocate!
      I0612 14:22:31.371814  5929 hierarchical_allocator_process.hpp:686] Performed allocation for 0 slaves in 57909ns
      I0612 14:22:32.370002  5929 master.cpp:114] No whitelist given. Advertising offers for all slaves
      I0612 14:22:32.372148  5929 hierarchical_allocator_process.hpp:724] No resources available to allocate!
      I0612 14:22:32.372166  5929 hierarchical_allocator_process.hpp:686] Performed allocation for 0 slaves in 22692ns
      2014-06-12 14:22:32,727:5908(0x7fc2e44b7700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:48451] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      I0612 14:22:33.372736  5929 hierarchical_allocator_process.hpp:724] No resources available to allocate!
      I0612 14:22:33.372779  5929 hierarchical_allocator_process.hpp:686] Performed allocation for 0 slaves in 52078ns
      I0612 14:22:34.373733  5929 hierarchical_allocator_process.hpp:724] No resources available to allocate!
      I0612 14:22:34.373778  5929 hierarchical_allocator_process.hpp:686] Performed allocation for 0 slaves in 54020ns
      I0612 14:22:35.036500  5929 slave.cpp:908] Will retry registration in 4.152129547secs if necessary
      I0612 14:22:35.036608  5929 master.cpp:2608] Ignoring register slave message from slave(36)@192.168.122.135:41488 (fedora-19) as admission is already in progress
      I0612 14:22:35.373988  5929 hierarchical_allocator_process.hpp:724] No resources available to allocate!
      I0612 14:22:35.374033  5929 hierarchical_allocator_process.hpp:686] Performed allocation for 0 slaves in 59902ns
      2014-06-12 14:22:36,063:5908(0x7fc2e44b7700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:48451] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      I0612 14:22:36.374763  5929 hierarchical_allocator_process.hpp:724] No resources available to allocate!
      I0612 14:22:36.374807  5929 hierarchical_allocator_process.hpp:686] Performed allocation for 0 slaves in 53729ns
      I0612 14:22:37.370941  5929 master.cpp:114] No whitelist given. Advertising offers for all slaves
      I0612 14:22:37.375161  5929 hierarchical_allocator_process.hpp:724] No resources available to allocate!
      I0612 14:22:37.375181  5929 hierarchical_allocator_process.hpp:686] Performed allocation for 0 slaves in 23517ns
      E0612 14:22:37.404731  5929 registrar.cpp:500] Registrar aborting: Failed to update 'registry': Failed to perform store within 10secs
      F0612 14:22:37.404844  5929 master.cpp:2642] Failed to admit slave 20140612-142227-2272962752-41488-5908-0 at slave(36)@192.168.122.135:41488 (fedora-19): Failed to update 'registry': Failed to perform store within 10secs
      *** Check failure stack trace: ***
          @     0x7fc31930fb9d  google::LogMessage::Fail()
          @     0x7fc3193118dc  google::LogMessage::SendToLog()
          @     0x7fc31930f78c  google::LogMessage::Flush()
      tests/slave_recovery_tests.cpp:636: Failure
      Failed to wait 10secs for offers1
          @     0x7fc3193121d9  google::LogMessageFatal::~LogMessageFatal()
          @     0x7fc318e41285  mesos::internal::master::Master::_registerSlave()
          @     0x7fc319244092  process::ProcessManager::resume()
          @     0x7fc31924438c  process::schedule()
          @       0x3284c07c53  (unknown)
          @       0x32844f5dbd  (unknown)
      

      Attachments

        Activity

          People

            bmahler Benjamin Mahler
            vinodkone Vinod Kone
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: