Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
0.9.2-incubating, 0.9.0.1, 0.9.3
-
None
-
None
Description
We recently had an issue where a worker process was shutdown cleaning on 0.9.0. The reason the worker shutdown cleanly is not the issue here, but it caused a cascading failure that made a connected worker shutdown too. This is going to be even more problematic in newer versions of storm when we give the worker time to shutdown cleanly instead of just shooting it with a kill -9
Ideally the client should continue to try and reconnect, because the worker may have exited on its own and will be re-spawned shortly. If it is rescheduled elsewhere the worker will eventually detect it and reroute things accordingly. This is what happens already when the connection is just closed. There really is no reason to have one side know when the other side is shutting down.
2014-08-11 19:00:17 b.s.util [ERROR] Async loop died! java.lang.RuntimeException: java.lang.RuntimeException: Client is being closed, and does not take requests any more at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:130) ~[storm-core-0.9.0-wip21.jar:na] at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:101) ~[storm-core-0.9.0-wip21.jar:na] at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:62) ~[storm-core-0.9.0-wip21.jar:na] at backtype.storm.disruptor$consume_loop_STAR_$fn__1999.invoke(disruptor.clj:74) ~[storm-core-0.9.0-wip21.jar:na] at backtype.storm.util$async_loop$fn__421.invoke(util.clj:400) ~[storm-core-0.9.0-wip21.jar:na] at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na] at java.lang.Thread.run(Thread.java:722) [na:1.7.0_17] Caused by: java.lang.RuntimeException: Client is being closed, and does not take requests any more at backtype.storm.messaging.netty.Client.send(Client.java:118) ~[storm-netty-0.9.0-wip21.jar:na] at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4922$fn__4923.invoke(worker.clj:342) ~[storm-core-0.9.0-wip21.jar:na] at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4922.invoke(worker.clj:331) ~[storm-core-0.9.0-wip21.jar:na] at backtype.storm.disruptor$clojure_handler$reify__1986.onEvent(disruptor.clj:43) ~[storm-core-0.9.0-wip21.jar:na] at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:127) ~[storm-core-0.9.0-wip21.jar:na] ... 6 common frames omitted 2014-08-11 19:00:17 b.s.util [INFO] Halting process: ("Async loop died!")
Attachments
Issue Links
- duplicates
-
STORM-404 Worker on one machine crashes due to a failure of another worker on another machine
- Resolved