Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.9.1-incubating
-
None
-
Debian Linux Wheezy
Zookeeper 3.3.3
Java 1.7.0_25
Description
Hi,
I've observed multiple times that supervisor state de-serialisation after host crash or reboot can fail. Supervisor is then unable to come up without manual intervention. AFAICT, it seems that serialized supervisor state if invalid and coun't be read at next start.
Observed error in supervisor log :
2014-04-29 19:38:35 c.n.c.f.i.CuratorFrameworkImpl [INFO] Starting 2014-04-29 19:38:35 o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=127.0.0.1:2181/storm sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@18d055e0 2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Opening socket connection to server /127.0.0.1:2181 2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Socket connection established to localhost/127.0.0.1:2181, initiating session 2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x145a7cc1c7e48b1, negotiated timeout = 20000 2014-04-29 19:38:35 b.s.d.supervisor [INFO] Starting supervisor with id 71b01216-9d00-4fb6-8538-6673058ab5ef at host storm 2014-04-29 19:38:36 b.s.event [ERROR] Error when processing event java.lang.RuntimeException: java.io.EOFException at backtype.storm.utils.Utils.deserialize(Utils.java:86) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.utils.LocalState.snapshot(LocalState.java:45) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.utils.LocalState.get(LocalState.java:56) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:207) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at clojure.lang.AFn.applyToHelper(AFn.java:161) ~[clojure-1.4.0.jar:na] at clojure.lang.AFn.applyTo(AFn.java:151) ~[clojure-1.4.0.jar:na] at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na] at clojure.core$partial$fn__4070.doInvoke(core.clj:2343) ~[clojure-1.4.0.jar:na] at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na] at backtype.storm.event$event_manager$fn__2593.invoke(event.clj:39) ~[na:na] at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] Caused by: java.io.EOFException: null at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) ~[na:1.7.0_25] at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792) ~[na:1.7.0_25] at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:799) ~[na:1.7.0_25] at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299) ~[na:1.7.0_25] at backtype.storm.utils.Utils.deserialize(Utils.java:81) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] ... 11 common frames omitted 2014-04-29 19:38:36 b.s.util [INFO] Halting process: ("Error when processing an event")
Current workaround : full stop supervisor daemon and delete all Storm's data/supervisor directory helped, and after restarting Supervisor is now running smoothly.
Here is some references of very similar issues :
- http://mail-archives.apache.org/mod_mbox/storm-user/201402.mbox/%3C23100d14e7ac4cef947f7236ef8963e1@BY2PR08MB144.namprd08.prod.outlook.com%3E
- https://groups.google.com/forum/#!topic/storm-user/SL9FK9XeoI8
- https://groups.google.com/forum/#!topic/storm-user/2gapTYTRrX8
Regards,