Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-1661

Scheduler leader failed to re-announce itself after ZK name was changed

    XMLWordPrintableJSON

Details

    • Story
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.16.0
    • None
    • None

    Description

      After we renamed our ZK cluster we were in a situation where the Aurora leader did not write to the ZK endpoint.

      Sequence that we performed:

      • change aurora cluster name in clusters.json
      • restart all schedulers (all good)
      • restart zk and move replication log (lose all state) during renaming: intentional
      • current Aurora leader fails to re-announce itself but election isn't retriggered so we lose our leader
      • some time later: restart all schedulers, and a new leader is elected (good again)

      Partial logs from the master (sorry, I'm not sure which parts are super relevant).

      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:10,723:30181(0x7f7586244700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [10.162.18.25:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:10,723:30181(0x7f7586a45700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [10.162.28.31:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:10,723:30181(0x7f7586244700):ZOO_INFO@check_events@1703: initiated connection to server [10.162.22.24:2181]
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:10,723:30181(0x7f7586a45700):ZOO_INFO@check_events@1703: initiated connection to server [10.162.2.27:2181]
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:10,733:30181(0x7f7586a45700):ZOO_INFO@check_events@1750: session establishment complete on server [10.162.2.27:2181], sessionId=0x28004d50ccb80001, negotiated timeout=4000
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.734082 30370 group.cpp:349] Group process (group(1)@10.162.12.31:8083) connected to ZooKeeper
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.734127 30370 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.734136 30370 group.cpp:427] Trying to create path '/aurora/replicated-log' in ZooKeeper
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.755173 30373 network.hpp:413] ZooKeeper group memberships changed
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.755249 30367 group.cpp:700] Trying to get '/aurora/replicated-log/0000000000' in ZooKeeper
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.756778 30364 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.162.14.30:8083 }
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.763407 30373 network.hpp:413] ZooKeeper group memberships changed
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.763478 30362 group.cpp:700] Trying to get '/aurora/replicated-log/0000000000' in ZooKeeper
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.764437 30362 group.cpp:700] Trying to get '/aurora/replicated-log/0000000001' in ZooKeeper
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.765494 30376 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.160.41.62:8083, log-replica(1)@10.162.14.30:8083 }
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:10,782:30181(0x7f7586244700):ZOO_INFO@check_events@1750: session establishment complete on server [10.162.22.24:2181], sessionId=0x26004d8049520002, negotiated timeout=4000
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.782429 30376 group.cpp:349] Group process (group(2)@10.162.12.31:8083) connected to ZooKeeper
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.782444 30376 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0)
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.782447 30376 group.cpp:427] Trying to create path '/aurora/replicated-log' in ZooKeeper
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.795754 30377 network.hpp:413] ZooKeeper group memberships changed
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.795805 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000000' in ZooKeeper
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.796530 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000001' in ZooKeeper
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.797600 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000002' in ZooKeeper
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.798283 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000003' in ZooKeeper
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.799046 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000004' in ZooKeeper
      Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.799967 30372 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.160.41.62:8083, log-replica(1)@10.162.9.54:8083, log-replica(1)@10.162.12.31:8083, log-replica(1)@10.162.14.30:8083, log-replica(1)@10.162.29.25:8083 }
      Apr 08 22:59:11 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:11.839 [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1032] Opening socket connection to server compute39-sjc1.prod.uber.internal/10.162.1.28:2181. Will not attempt to authenticate using SA
      Apr 08 22:59:11 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:11.839 [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:876] Socket connection established to compute39-sjc1.prod.uber.internal/10.162.1.28:2181, initiating session
      Apr 08 22:59:11 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:11.842 [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1158] Unable to read additional data from server sessionid 0x270049a1666d1483, likely server has closed socket, closing socket connecti
      Apr 08 22:59:12 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:12.810 [RedirectMonitor STARTING-SendThread(compute38-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1032] Opening socket connection to server compute38-sjc1.prod.uber.internal/10.162.22.24:2181. Will not attempt to authenticate using S
      Apr 08 22:59:12 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:12.810 [RedirectMonitor STARTING-SendThread(compute38-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:876] Socket connection established to compute38-sjc1.prod.uber.internal/10.162.22.24:2181, initiating session
      Apr 08 22:59:12 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:12.811 [RedirectMonitor STARTING-SendThread(compute38-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1158] Unable to read additional data from server sessionid 0x270049a1666d1483, likely server has closed socket, closing socket connecti
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:13,460:30181(0x7f7577fff700):ZOO_INFO@check_events@1703: initiated connection to server [10.162.2.27:2181]
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:13,462:30181(0x7f7577fff700):ZOO_INFO@check_events@1750: session establishment complete on server [10.162.2.27:2181], sessionId=0x28004d50ccb8001c, negotiated timeout=10000
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.463156 30378 group.cpp:349] Group process (group(3)@10.162.12.31:8083) connected to ZooKeeper
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.463179 30378 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.463186 30378 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.466274 30373 detector.cpp:154] Detected a new leader: (id='1')
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.466698 30365 group.cpp:700] Trying to get '/mesos/json.info_0000000001' in ZooKeeper
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467295 30360 detector.cpp:479] A new leading master (UPID=master@10.162.29.25:5050) is detected
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467350 30370 sched.cpp:326] New master detected at master@10.162.29.25:5050
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467427 30370 sched.cpp:382] Authenticating with master master@10.162.29.25:5050
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467437 30370 sched.cpp:389] Using default CRAM-MD5 authenticatee
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467504 30360 authenticatee.cpp:121] Creating new client SASL connection
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.753 [RedirectMonitor STARTING-SendThread(compute36-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1032] Opening socket connection to server compute36-sjc1.prod.uber.internal/10.162.28.31:2181. Will not attempt to authenticate using S
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: W0408 22:59:13.753 [RedirectMonitor STARTING-SendThread(compute36-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1162] Session 0x270049a1666d1483 for server null, unexpected error, closing socket connection and attempting reconnect java.net.Connect
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.8.0_72-internal]
      ...
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[na:1.8.0_72-internal]
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) ~[zookeeper-3.4.8.jar:3.4.8--1]
      Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) ~[zookeeper-3.4.8.jar:3.4.8--1]
      Apr 08 22:59:14 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:14.851 [RedirectMonitor STARTING-SendThread(compute40-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1032] Opening socket connection to server compute40-sjc1.prod.uber.internal/10.162.2.27:2181. Will not attempt to authenticate using SA
      Apr 08 22:59:14 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:14.851 [RedirectMonitor STARTING-SendThread(compute40-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:876] Socket connection established to compute40-sjc1.prod.uber.internal/10.162.2.27:2181, initiating session
      Apr 08 22:59:14 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:14.852 [RedirectMonitor STARTING-SendThread(compute40-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1158] Unable to read additional data from server sessionid 0x270049a1666d1483, likely server has closed socket, closing socket connecti
      ...
      Apr 08 22:59:50 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:50.538 [RedirectMonitor STARTING-SendThread(compute37-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1158] Unable to read additional data from server sessionid 0x270049a1666d1483, likely server has closed socket, closing socket connecti
      Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:52.273 [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1032] Opening socket connection to server compute39-sjc1.prod.uber.internal/10.162.1.28:2181. Will not attempt to authenticate using SA
      Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: W0408 22:59:52.273 [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1108] Client session timed out, have not heard from server in 1635ms for sessionid 0x270049a1666d1483
      Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:52.273 [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1156] Client session timed out, have not heard from server in 1635ms for sessionid 0x270049a1666d1483, closing socket connection and at
      Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: W0408 22:59:52.374 [AsyncProcessor-7, Group$ActiveMembership:370] Temporary error cancelling membership:/aurora/scheduler/member_0000000360 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /aurora/scheduler/member_0000000360
      ...
      

      Attachments

        Issue Links

          Activity

            People

              jsirois John Sirois
              shirchen Dmitriy Shirchenko
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: