Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-1548

FaultToleranceTest.MasterFailover is flaky

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • None
    • None
    • test

    Description

      [ RUN      ] FaultToleranceTest.MasterFailover
      Using temporary directory '/tmp/FaultToleranceTest_MasterFailover_An91kb'
      I0626 15:21:11.941083 13589 leveldb.cpp:176] Opened db in 684083ns
      I0626 15:21:11.941844 13589 leveldb.cpp:183] Compacted db in 195468ns
      I0626 15:21:11.942226 13589 leveldb.cpp:198] Created db iterator in 6853ns
      I0626 15:21:11.942625 13589 leveldb.cpp:204] Seeked to beginning of db in 1662ns
      I0626 15:21:11.942981 13589 leveldb.cpp:273] Iterated through 0 keys in the db in 781ns
      I0626 15:21:11.943406 13589 replica.cpp:741] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
      I0626 15:21:11.944072 13608 recover.cpp:425] Starting replica recovery
      I0626 15:21:11.944135 13608 recover.cpp:451] Replica is in EMPTY status
      I0626 15:21:11.944373 13608 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
      I0626 15:21:11.944427 13608 recover.cpp:188] Received a recover response from a replica in EMPTY status
      I0626 15:21:11.944505 13608 recover.cpp:542] Updating replica status to STARTING
      I0626 15:21:11.944641 13608 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 58924ns
      I0626 15:21:11.944658 13608 replica.cpp:320] Persisted replica status to STARTING
      I0626 15:21:11.944699 13608 recover.cpp:451] Replica is in STARTING status
      I0626 15:21:11.945467 13608 replica.cpp:638] Replica in STARTING status received a broadcasted recover request
      I0626 15:21:11.945942 13608 recover.cpp:188] Received a recover response from a replica in STARTING status
      I0626 15:21:11.946418 13608 recover.cpp:542] Updating replica status to VOTING
      I0626 15:21:11.946910 13608 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 19364ns
      I0626 15:21:11.947319 13608 replica.cpp:320] Persisted replica status to VOTING
      I0626 15:21:11.947773 13612 recover.cpp:556] Successfully joined the Paxos group
      I0626 15:21:11.948055 13610 recover.cpp:440] Recover process terminated
      I0626 15:21:11.953730 13612 master.cpp:288] Master 20140626-152111-2759502016-54657-13589 (fedora-20) started on 192.168.122.164:54657
      I0626 15:21:11.953763 13612 master.cpp:325] Master only allowing authenticated frameworks to register
      I0626 15:21:11.953770 13612 master.cpp:330] Master only allowing authenticated slaves to register
      I0626 15:21:11.953779 13612 credentials.hpp:35] Loading credentials for authentication from '/tmp/FaultToleranceTest_MasterFailover_An91kb/credentials'
      I0626 15:21:11.953865 13612 master.cpp:356] Authorization enabled
      I0626 15:21:11.954516 13615 hierarchical_allocator_process.hpp:301] Initializing hierarchical allocator process with master : master@192.168.122.164:54657
      I0626 15:21:11.954876 13613 master.cpp:122] No whitelist given. Advertising offers for all slaves
      I0626 15:21:11.955338 13612 master.cpp:1122] The newly elected leader is master@192.168.122.164:54657 with id 20140626-152111-2759502016-54657-13589
      I0626 15:21:11.955361 13612 master.cpp:1135] Elected as the leading master!
      I0626 15:21:11.955369 13612 master.cpp:953] Recovering from registrar
      I0626 15:21:11.955417 13612 registrar.cpp:313] Recovering registrar
      I0626 15:21:11.955669 13609 log.cpp:656] Attempting to start the writer
      I0626 15:21:11.956382 13609 replica.cpp:474] Replica received implicit promise request with proposal 1
      I0626 15:21:11.956444 13609 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 37618ns
      I0626 15:21:11.956456 13609 replica.cpp:342] Persisted promised to 1
      I0626 15:21:11.956651 13609 coordinator.cpp:230] Coordinator attemping to fill missing position
      I0626 15:21:11.957474 13609 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2
      I0626 15:21:11.957521 13609 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 21096ns
      I0626 15:21:11.957532 13609 replica.cpp:676] Persisted action at 0
      I0626 15:21:11.958163 13609 replica.cpp:508] Replica received write request for position 0
      I0626 15:21:11.958684 13609 leveldb.cpp:438] Reading position from leveldb took 20198ns
      I0626 15:21:11.958922 13609 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 44830ns
      I0626 15:21:11.958938 13609 replica.cpp:676] Persisted action at 0
      I0626 15:21:11.959285 13609 replica.cpp:655] Replica received learned notice for position 0
      I0626 15:21:11.959832 13609 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 47773ns
      I0626 15:21:11.960201 13609 replica.cpp:676] Persisted action at 0
      I0626 15:21:11.960614 13609 replica.cpp:661] Replica learned NOP action at position 0
      I0626 15:21:11.960783 13609 log.cpp:672] Writer started with ending position 0
      I0626 15:21:11.961611 13614 leveldb.cpp:438] Reading position from leveldb took 12133ns
      I0626 15:21:11.968036 13613 registrar.cpp:346] Successfully fetched the registry (0B)
      I0626 15:21:11.968072 13613 registrar.cpp:422] Attempting to update the 'registry'
      I0626 15:21:11.969460 13609 log.cpp:680] Attempting to append 131 bytes to the log
      I0626 15:21:11.969980 13609 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 1
      I0626 15:21:11.970623 13613 replica.cpp:508] Replica received write request for position 1
      I0626 15:21:11.970674 13613 leveldb.cpp:343] Persisting action (150 bytes) to leveldb took 28212ns
      I0626 15:21:11.970687 13613 replica.cpp:676] Persisted action at 1
      I0626 15:21:11.971130 13610 replica.cpp:655] Replica received learned notice for position 1
      I0626 15:21:11.971504 13610 leveldb.cpp:343] Persisting action (152 bytes) to leveldb took 349741ns
      I0626 15:21:11.971525 13610 replica.cpp:676] Persisted action at 1
      I0626 15:21:11.971534 13610 replica.cpp:661] Replica learned APPEND action at position 1
      I0626 15:21:11.974668 13614 registrar.cpp:479] Successfully updated 'registry'
      I0626 15:21:11.974843 13614 registrar.cpp:372] Successfully recovered registrar
      I0626 15:21:11.974921 13614 master.cpp:980] Recovered 0 slaves from the Registry (95B) ; allowing 10mins for slaves to re-register
      I0626 15:21:11.976786 13612 log.cpp:699] Attempting to truncate the log to 1
      I0626 15:21:11.976835 13612 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 2
      I0626 15:21:11.977008 13612 replica.cpp:508] Replica received write request for position 2
      I0626 15:21:11.977048 13612 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 20691ns
      I0626 15:21:11.977061 13612 replica.cpp:676] Persisted action at 2
      I0626 15:21:11.977205 13612 replica.cpp:655] Replica received learned notice for position 2
      I0626 15:21:11.977236 13612 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 15525ns
      I0626 15:21:11.977252 13612 leveldb.cpp:401] Deleting ~1 keys from leveldb took 5528ns
      I0626 15:21:11.977260 13612 replica.cpp:676] Persisted action at 2
      I0626 15:21:11.977284 13612 replica.cpp:661] Replica learned TRUNCATE action at position 2
      I0626 15:21:11.981732 13589 sched.cpp:139] Version: 0.20.0
      I0626 15:21:11.982229 13614 sched.cpp:235] New master detected at master@192.168.122.164:54657
      I0626 15:21:11.982256 13614 sched.cpp:285] Authenticating with master master@192.168.122.164:54657
      I0626 15:21:11.982333 13614 authenticatee.hpp:128] Creating new client SASL connection
      I0626 15:21:11.982904 13614 master.cpp:3499] Authenticating scheduler-13d6b94c-c4b0-4236-b198-5da18a8c7dcc@192.168.122.164:54657
      I0626 15:21:11.982995 13614 authenticator.hpp:156] Creating new server SASL connection
      I0626 15:21:11.983481 13614 authenticatee.hpp:219] Received SASL authentication mechanisms: CRAM-MD5
      I0626 15:21:11.983508 13614 authenticatee.hpp:245] Attempting to authenticate with mechanism 'CRAM-MD5'
      I0626 15:21:11.983536 13614 authenticator.hpp:262] Received SASL authentication start
      I0626 15:21:11.983820 13614 authenticator.hpp:384] Authentication requires more steps
      I0626 15:21:11.983865 13614 authenticatee.hpp:265] Received SASL authentication step
      I0626 15:21:11.983907 13614 authenticator.hpp:290] Received SASL authentication step
      I0626 15:21:11.983924 13614 auxprop.cpp:81] Request to lookup properties for user: 'test-principal' realm: 'fedora-20' server FQDN: 'fedora-20' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: false 
      I0626 15:21:11.983932 13614 auxprop.cpp:153] Looking up auxiliary property '*userPassword'
      I0626 15:21:11.983944 13614 auxprop.cpp:153] Looking up auxiliary property '*cmusaslsecretCRAM-MD5'
      I0626 15:21:11.983953 13614 auxprop.cpp:81] Request to lookup properties for user: 'test-principal' realm: 'fedora-20' server FQDN: 'fedora-20' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: true 
      I0626 15:21:11.983960 13614 auxprop.cpp:103] Skipping auxiliary property '*userPassword' since SASL_AUXPROP_AUTHZID == true
      I0626 15:21:11.983966 13614 auxprop.cpp:103] Skipping auxiliary property '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true
      I0626 15:21:11.983978 13614 authenticator.hpp:376] Authentication success
      I0626 15:21:11.984005 13614 authenticatee.hpp:305] Authentication success
      I0626 15:21:11.984029 13614 master.cpp:3539] Successfully authenticated principal 'test-principal' at scheduler-13d6b94c-c4b0-4236-b198-5da18a8c7dcc@192.168.122.164:54657
      I0626 15:21:11.984098 13614 sched.cpp:359] Successfully authenticated with master master@192.168.122.164:54657
      I0626 15:21:11.984113 13614 sched.cpp:478] Sending registration request to master@192.168.122.164:54657
      I0626 15:21:11.984158 13614 master.cpp:1241] Received registration request from scheduler-13d6b94c-c4b0-4236-b198-5da18a8c7dcc@192.168.122.164:54657
      I0626 15:21:11.984179 13614 master.cpp:1201] Authorizing framework principal 'test-principal' to receive offers for role '*'
      I0626 15:21:11.984274 13614 master.cpp:1300] Registering framework 20140626-152111-2759502016-54657-13589-0000 at scheduler-13d6b94c-c4b0-4236-b198-5da18a8c7dcc@192.168.122.164:54657
      I0626 15:21:11.984365 13614 sched.cpp:409] Framework registered with 20140626-152111-2759502016-54657-13589-0000
      I0626 15:21:11.984388 13614 sched.cpp:423] Scheduler::registered took 8620ns
      I0626 15:21:11.984426 13614 hierarchical_allocator_process.hpp:331] Added framework 20140626-152111-2759502016-54657-13589-0000
      I0626 15:21:11.984436 13614 hierarchical_allocator_process.hpp:724] No resources available to allocate!
      I0626 15:21:11.984442 13614 hierarchical_allocator_process.hpp:686] Performed allocation for 0 slaves in 7446ns
      I0626 15:21:11.984794 13612 master.cpp:619] Master terminating
      2014-06-26 15:21:12,255:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:21:15,591:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:21:18,927:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:21:22,263:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:21:25,599:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:21:28,935:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:21:32,271:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:21:35,607:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:21:38,943:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:21:42,282:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:21:45,622:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:21:48,959:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:21:52,295:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:21:55,632:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:21:58,968:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:02,304:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:05,640:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:08,976:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:12,313:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:15,650:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:18,986:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:22,323:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:25,659:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:28,995:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:32,331:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:35,667:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:39,005:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:42,343:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:45,679:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:49,016:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      2014-06-26 15:22:52,352:13589(0x7fc42b2cf700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:51226] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            vinodkone Vinod Kone
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment