Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-2987

PaceMakerStateStorage should deal with InterruptedException correctly

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: None

      Description

      We found an issue that when nimbus restarted, it can only get leadership after a few seconds (15~20s). 

       

      2018-02-27 08:18:43.420 main o.a.s.z.LeaderElectorImp [INFO] Queued up for leader lock.
      2018-02-27 08:18:43.481 main o.a.s.d.m.MetricsUtils [INFO] Using statistics reporter plugin:org.apache.storm.daemon.metrics.reporters.JmxPreparableReporter
      2018-02-27 08:18:43.483 main o.a.s.d.m.r.JmxPreparableReporter [INFO] Preparing...
      2018-02-27 08:18:43.499 main o.a.s.m.StormMetricsRegistry [INFO] Started statistics report plugin...
      2018-02-27 08:18:43.543 main o.a.s.m.n.Login [INFO] successfully logged in.
      2018-02-27 08:18:43.551 main o.a.s.z.ClientZookeeper [INFO] Staring ZK Curator
      2018-02-27 08:18:43.551 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Starting
      2018-02-27 08:18:43.552 Refresh-TGT o.a.s.m.n.Login [INFO] TGT refresh thread started.
      2018-02-27 08:18:43.553 Refresh-TGT o.a.s.m.n.Login [INFO] TGT valid starting at:        Tue Feb 27 08:18:43 UTC 2018
      2018-02-27 08:18:43.553 Refresh-TGT o.a.s.m.n.Login [INFO] TGT expires:                  Wed Feb 28 08:18:43 UTC 2018
      2018-02-27 08:18:43.553 Refresh-TGT o.a.s.m.n.Login [INFO] TGT refresh sleeping until: Wed Feb 28 04:35:55 UTC 2018
      2018-02-27 08:18:43.553 main o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=openqe74blue-gw.blue.ygrid.yahoo.com:2181 sessionTimeout=60000 watcher=org.apache.
      curator.ConnectionState@2e185cd7
      2018-02-27 08:18:43.559 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Default schema
      2018-02-27 08:18:43.560 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.c.ZooKeeperSaslClient [INFO] Client will use GSSAPI as SASL mechanism.
      2018-02-27 08:18:43.561 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Opening socket connection to server openqe74blue-gw.blue.ygrid.yahoo.co
      m/10.215.68.156:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
      2018-02-27 08:18:43.562 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Socket connection established to openqe74blue-gw.blue.ygrid.yahoo.com/1
      0.215.68.156:2181, initiating session
      2018-02-27 08:18:43.565 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Session establishment complete on server openqe74blue-gw.blue.ygrid.yah
      oo.com/10.215.68.156:2181, sessionid = 0x161d5f1ae970099, negotiated timeout = 40000
      2018-02-27 08:18:43.565 main-EventThread o.a.c.f.s.ConnectionStateManager [INFO] State change: CONNECTED
      2018-02-27 08:18:43.605 Curator-Framework-0 o.a.c.f.i.CuratorFrameworkImpl [INFO] backgroundOperationsLoop exiting
      2018-02-27 08:18:43.625 main o.a.z.ZooKeeper [INFO] Session: 0x161d5f1ae970099 closed
      2018-02-27 08:18:43.625 main-EventThread o.a.z.ClientCnxn [INFO] EventThread shut down
      2018-02-27 08:18:43.626 main o.a.s.z.ClientZookeeper [INFO] Staring ZK Curator
      2018-02-27 08:18:43.626 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Starting
      2018-02-27 08:18:43.635 main o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=openqe74blue-gw.blue.ygrid.yahoo.com:2181/storm_ystormQE_CI sessionTimeout=60000 w
      atcher=org.apache.curator.ConnectionState@46cc127b
      2018-02-27 08:18:43.654 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.c.ZooKeeperSaslClient [INFO] Client will use GSSAPI as SASL mechanism.
      2018-02-27 08:18:43.660 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Opening socket connection to server openqe74blue-gw.blue.ygrid.yahoo.co
      m/10.215.68.156:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
      2018-02-27 08:18:43.663 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Default schema
      2018-02-27 08:18:43.663 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Socket connection established to openqe74blue-gw.blue.ygrid.yahoo.com/1
      0.215.68.156:2181, initiating session
      2018-02-27 08:18:43.666 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Session establishment complete on server openqe74blue-gw.blue.ygrid.yah
      oo.com/10.215.68.156:2181, sessionid = 0x161d5f1ae97009a, negotiated timeout = 40000
      2018-02-27 08:18:43.669 main-EventThread o.a.c.f.s.ConnectionStateManager [INFO] State change: CONNECTED
      2018-02-27 08:18:43.790 main o.a.s.d.n.Nimbus [INFO] Starting nimbus server for storm version '2.0.0.y'
      2018-02-27 08:18:44.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping assignments
      2018-02-27 08:18:44.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping cleanup
      2018-02-27 08:18:44.300 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: blob-5-1518767144-stormcode.ser
      2018-02-27 08:18:44.301 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: TestZkErrorNodesHaveCorrectAcls-3-1519540302-stormcode.ser
      2018-02-27 08:18:44.302 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: logviewer-ui-groups-test-1-1518940914-stormcode.ser
      2018-02-27 08:18:44.303 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: blob-5-1518800831-stormcode.ser
      2018-02-27 08:18:44.304 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: blob-5-1518767144-stormconf.ser
      2018-02-27 08:18:44.306 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: logviewer-ui-groups-test-1-1518940914-stormconf.ser
      2018-02-27 08:18:44.307 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: TestZkErrorNodesHaveCorrectAcls-3-1519540302-stormconf.ser
      2018-02-27 08:18:44.308 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: blob-5-1518800831-stormconf.ser
      2018-02-27 08:18:44.367 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping credential renewal.
      2018-02-27 08:18:54.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping assignments
      2018-02-27 08:18:54.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping cleanup
      2018-02-27 08:18:59.367 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping credential renewal.
      2018-02-27 08:19:02.059 main-EventThread o.a.s.z.Zookeeper [INFO] active-topology-blobs [] local-topology-blobs [] diff-topology-blobs []
      2018-02-27 08:19:02.059 main-EventThread o.a.s.z.Zookeeper [INFO] active-topology-dependencies [] local-blobs [] diff-topology-dependencies []
      2018-02-27 08:19:02.059 main-EventThread o.a.s.z.Zookeeper [INFO] Accepting leadership, all active topologies and corresponding dependencies found locally.
      2018-02-27 08:19:04.754 timer o.a.s.d.n.Nimbus [INFO] Scheduling took 442 ms for 0 topologies
      

       

      This can be re-produced by the following steps:

      1. restart pacemaker;
      2. before pacemaker is up, restart nimbus

      When we restart nimbus process,  it runs ShutDownHooks and stuck on timer.close(). 

      https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L4227

      The timer is not able to close because it's waiting for  doCleanup() to stop. However, the interruptedException is caught and ate in : https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/pacemaker/PacemakerClient.java#L180-L192

       

       

       

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ethanli Ethan Li
                Reporter:
                ethanli Ethan Li
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h
                  3h