Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-307

After host crash, supervisor is unable to restart itself

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.9.1-incubating
    • 0.9.3-rc2
    • storm-core
    • None
    • Debian Linux Wheezy
      Zookeeper 3.3.3
      Java 1.7.0_25

    Description

      Hi,

      I've observed multiple times that supervisor state de-serialisation after host crash or reboot can fail. Supervisor is then unable to come up without manual intervention. AFAICT, it seems that serialized supervisor state if invalid and coun't be read at next start.

      Observed error in supervisor log :

      2014-04-29 19:38:35 c.n.c.f.i.CuratorFrameworkImpl [INFO] Starting
      2014-04-29 19:38:35 o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=127.0.0.1:2181/storm sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@18d055e0
      2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Opening socket connection to server /127.0.0.1:2181
      2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Socket connection established to localhost/127.0.0.1:2181, initiating session
      2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x145a7cc1c7e48b1, negotiated timeout = 20000
      2014-04-29 19:38:35 b.s.d.supervisor [INFO] Starting supervisor with id 71b01216-9d00-4fb6-8538-6673058ab5ef at host storm
      2014-04-29 19:38:36 b.s.event [ERROR] Error when processing event
      java.lang.RuntimeException: java.io.EOFException
              at backtype.storm.utils.Utils.deserialize(Utils.java:86) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
              at backtype.storm.utils.LocalState.snapshot(LocalState.java:45) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
              at backtype.storm.utils.LocalState.get(LocalState.java:56) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
              at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:207) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
              at clojure.lang.AFn.applyToHelper(AFn.java:161) ~[clojure-1.4.0.jar:na]
              at clojure.lang.AFn.applyTo(AFn.java:151) ~[clojure-1.4.0.jar:na]
              at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na]
              at clojure.core$partial$fn__4070.doInvoke(core.clj:2343) ~[clojure-1.4.0.jar:na]
              at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na]
              at backtype.storm.event$event_manager$fn__2593.invoke(event.clj:39) ~[na:na]
              at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na]
              at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
      Caused by: java.io.EOFException: null
              at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) ~[na:1.7.0_25]
              at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792) ~[na:1.7.0_25]
              at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:799) ~[na:1.7.0_25]
              at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299) ~[na:1.7.0_25]
              at backtype.storm.utils.Utils.deserialize(Utils.java:81) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
              ... 11 common frames omitted
      2014-04-29 19:38:36 b.s.util [INFO] Halting process: ("Error when processing an event")
      

      Current workaround : full stop supervisor daemon and delete all Storm's data/supervisor directory helped, and after restarting Supervisor is now running smoothly.

      Here is some references of very similar issues :

      Regards,

      Attachments

        1. supeof.tar.bz2
          0.6 kB
          Simon Cooper

        Activity

          People

            wurstmeister Thomas Becker
            drazzib Damien Raude-Morvan
            Votes:
            3 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: