Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.0.0
-
None
Description
We found an issue that when nimbus restarted, it can only get leadership after a few seconds (15~20s).
2018-02-27 08:18:43.420 main o.a.s.z.LeaderElectorImp [INFO] Queued up for leader lock. 2018-02-27 08:18:43.481 main o.a.s.d.m.MetricsUtils [INFO] Using statistics reporter plugin:org.apache.storm.daemon.metrics.reporters.JmxPreparableReporter 2018-02-27 08:18:43.483 main o.a.s.d.m.r.JmxPreparableReporter [INFO] Preparing... 2018-02-27 08:18:43.499 main o.a.s.m.StormMetricsRegistry [INFO] Started statistics report plugin... 2018-02-27 08:18:43.543 main o.a.s.m.n.Login [INFO] successfully logged in. 2018-02-27 08:18:43.551 main o.a.s.z.ClientZookeeper [INFO] Staring ZK Curator 2018-02-27 08:18:43.551 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Starting 2018-02-27 08:18:43.552 Refresh-TGT o.a.s.m.n.Login [INFO] TGT refresh thread started. 2018-02-27 08:18:43.553 Refresh-TGT o.a.s.m.n.Login [INFO] TGT valid starting at: Tue Feb 27 08:18:43 UTC 2018 2018-02-27 08:18:43.553 Refresh-TGT o.a.s.m.n.Login [INFO] TGT expires: Wed Feb 28 08:18:43 UTC 2018 2018-02-27 08:18:43.553 Refresh-TGT o.a.s.m.n.Login [INFO] TGT refresh sleeping until: Wed Feb 28 04:35:55 UTC 2018 2018-02-27 08:18:43.553 main o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=openqe74blue-gw.blue.ygrid.yahoo.com:2181 sessionTimeout=60000 watcher=org.apache. curator.ConnectionState@2e185cd7 2018-02-27 08:18:43.559 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Default schema 2018-02-27 08:18:43.560 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.c.ZooKeeperSaslClient [INFO] Client will use GSSAPI as SASL mechanism. 2018-02-27 08:18:43.561 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Opening socket connection to server openqe74blue-gw.blue.ygrid.yahoo.co m/10.215.68.156:2181. Will attempt to SASL-authenticate using Login Context section 'Client' 2018-02-27 08:18:43.562 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Socket connection established to openqe74blue-gw.blue.ygrid.yahoo.com/1 0.215.68.156:2181, initiating session 2018-02-27 08:18:43.565 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Session establishment complete on server openqe74blue-gw.blue.ygrid.yah oo.com/10.215.68.156:2181, sessionid = 0x161d5f1ae970099, negotiated timeout = 40000 2018-02-27 08:18:43.565 main-EventThread o.a.c.f.s.ConnectionStateManager [INFO] State change: CONNECTED 2018-02-27 08:18:43.605 Curator-Framework-0 o.a.c.f.i.CuratorFrameworkImpl [INFO] backgroundOperationsLoop exiting 2018-02-27 08:18:43.625 main o.a.z.ZooKeeper [INFO] Session: 0x161d5f1ae970099 closed 2018-02-27 08:18:43.625 main-EventThread o.a.z.ClientCnxn [INFO] EventThread shut down 2018-02-27 08:18:43.626 main o.a.s.z.ClientZookeeper [INFO] Staring ZK Curator 2018-02-27 08:18:43.626 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Starting 2018-02-27 08:18:43.635 main o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=openqe74blue-gw.blue.ygrid.yahoo.com:2181/storm_ystormQE_CI sessionTimeout=60000 w atcher=org.apache.curator.ConnectionState@46cc127b 2018-02-27 08:18:43.654 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.c.ZooKeeperSaslClient [INFO] Client will use GSSAPI as SASL mechanism. 2018-02-27 08:18:43.660 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Opening socket connection to server openqe74blue-gw.blue.ygrid.yahoo.co m/10.215.68.156:2181. Will attempt to SASL-authenticate using Login Context section 'Client' 2018-02-27 08:18:43.663 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Default schema 2018-02-27 08:18:43.663 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Socket connection established to openqe74blue-gw.blue.ygrid.yahoo.com/1 0.215.68.156:2181, initiating session 2018-02-27 08:18:43.666 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Session establishment complete on server openqe74blue-gw.blue.ygrid.yah oo.com/10.215.68.156:2181, sessionid = 0x161d5f1ae97009a, negotiated timeout = 40000 2018-02-27 08:18:43.669 main-EventThread o.a.c.f.s.ConnectionStateManager [INFO] State change: CONNECTED 2018-02-27 08:18:43.790 main o.a.s.d.n.Nimbus [INFO] Starting nimbus server for storm version '2.0.0.y' 2018-02-27 08:18:44.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping assignments 2018-02-27 08:18:44.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping cleanup 2018-02-27 08:18:44.300 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: blob-5-1518767144-stormcode.ser 2018-02-27 08:18:44.301 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: TestZkErrorNodesHaveCorrectAcls-3-1519540302-stormcode.ser 2018-02-27 08:18:44.302 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: logviewer-ui-groups-test-1-1518940914-stormcode.ser 2018-02-27 08:18:44.303 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: blob-5-1518800831-stormcode.ser 2018-02-27 08:18:44.304 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: blob-5-1518767144-stormconf.ser 2018-02-27 08:18:44.306 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: logviewer-ui-groups-test-1-1518940914-stormconf.ser 2018-02-27 08:18:44.307 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: TestZkErrorNodesHaveCorrectAcls-3-1519540302-stormconf.ser 2018-02-27 08:18:44.308 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: blob-5-1518800831-stormconf.ser 2018-02-27 08:18:44.367 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping credential renewal. 2018-02-27 08:18:54.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping assignments 2018-02-27 08:18:54.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping cleanup 2018-02-27 08:18:59.367 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping credential renewal. 2018-02-27 08:19:02.059 main-EventThread o.a.s.z.Zookeeper [INFO] active-topology-blobs [] local-topology-blobs [] diff-topology-blobs [] 2018-02-27 08:19:02.059 main-EventThread o.a.s.z.Zookeeper [INFO] active-topology-dependencies [] local-blobs [] diff-topology-dependencies [] 2018-02-27 08:19:02.059 main-EventThread o.a.s.z.Zookeeper [INFO] Accepting leadership, all active topologies and corresponding dependencies found locally. 2018-02-27 08:19:04.754 timer o.a.s.d.n.Nimbus [INFO] Scheduling took 442 ms for 0 topologies
This can be re-produced by the following steps:
1. restart pacemaker; 2. before pacemaker is up, restart nimbus
When we restart nimbus process, it runs ShutDownHooks and stuck on timer.close().
The timer is not able to close because it's waiting for doCleanup() to stop. However, the interruptedException is caught and ate in : https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/pacemaker/PacemakerClient.java#L180-L192
Attachments
Issue Links
- links to