Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-3107

Nimbus confused about leadership after crash

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Cannot Reproduce
    • 2.0.0
    • None
    • None
    • None

    Description

      Nimbus crashed and restarted without shutting down zookeeper due to a deadlock in the timer shutdown code.  This could however also happen for various other issues.  

       

      The problem is that once Nimbus restarted, it was really confused about who the leader was:

       

      2018-05-24 09:27:21.762 o.a.s.z.LeaderElectorImp main [INFO] Queued up for leader lock.
      2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping assignments
      2018-05-24 09:27:22.604 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping cleanup
      2018-05-24 09:27:22.633 o.a.s.d.n.Nimbus timer [INFO] not a leader, skipping credential renewal.
      
      2018-05-24 09:27:40.771 o.a.s.d.n.Nimbus pool-37-thread-63 [WARN] Topology submission exception. (topology name='topology-testOverSubscribe-1')
      java.lang.RuntimeException: not a leader, current leader is NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, isLeader=true}
              at org.apache.storm.daemon.nimbus.Nimbus.assertIsLeader(Nimbus.java:1311) ~[storm-server-2.0.0.y.jar:2.0.0.y]
              at org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2807) ~[storm-server-2.0.0.y.jar:2.0.0.y]
              at org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3454) ~[storm-client-2.0.0.y.jar:2.0.0.y]
              at org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3438) ~[storm-client-2.0.0.y.jar:2.0.0.y]
              at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.3.jar:0.9.3]
              at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.3.jar:0.9.3]
              at org.apache.storm.security.auth.sasl.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:147) ~[storm-client-2.0.0.y.jar:2.0.0.y]
              at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) ~[libthrift-0.9.3.jar:0.9.3]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
              at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
      2018-05-24 09:27:40.771 o.a.s.b.BlobStoreUtils Timer-1 [ERROR] Could not download the blob with key: topology-testOverCapacityScheduling-2-1519992333-stormcode.ser
      2018-05-24 09:27:40.771 o.a.t.s.TThreadPoolServer pool-37-thread-63 [ERROR] Error occurred during processing of message.
      java.lang.RuntimeException: java.lang.RuntimeException: not a leader, current leader is NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, isLeader=true}
              at org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2961) ~[storm-server-2.0.0.y.jar:2.0.0.y]
              at org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3454) ~[storm-client-2.0.0.y.jar:2.0.0.y]
              at org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3438) ~[storm-client-2.0.0.y.jar:2.0.0.y]
              at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.3.jar:0.9.3]
              at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.3.jar:0.9.3]
              at org.apache.storm.security.auth.sasl.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:147) ~[storm-client-2.0.0.y.jar:2.0.0.y]
              at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) ~[libthrift-0.9.3.jar:0.9.3]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
              at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
      Caused by: java.lang.RuntimeException: not a leader, current leader is NimbusInfo{host='openqe82blue-n1.blue.ygrid.yahoo.com', port=50560, isLeader=true}
              at org.apache.storm.daemon.nimbus.Nimbus.assertIsLeader(Nimbus.java:1311) ~[storm-server-2.0.0.y.jar:2.0.0.y]
              at org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:2807) ~[storm-server-2.0.0.y.jar:2.0.0.y]
              ... 9 more
      

      The session timeout was set to 20 seconds, but we're exceeding this period, and Nimbus did not recover leadership.  It needed to be restarted manually to recover.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            agresch Aaron Gresch
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: