Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
0.9.2-incubating, 0.9.3, 0.9.4
-
None
Description
https://github.com/nathanmarz/storm/issues/438
Hi developers,
We met critical issue with deploying storm topology to our prod cluster.
After deploying topology we got trace on workers (Storm 0.8.2/zookeeper-3.3.6) :
2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection, connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181 sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@254ba9a2
2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server zookeeper1.company.com/10.72.209.112:2181
2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to zookeeper1.company.com/10.72.209.112:2181, initiating session
2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on server zookeeper1.company.com/10.72.209.112:2181, sessionid = 0x13b3e4b5c780239, negotiated timeout = 20000
2013-01-14 10:57:39 zookeeper [INFO] Zookeeper state update: :connected:none
2013-01-14 10:57:39 ZooKeeper [INFO] Session: 0x13b3e4b5c780239 closed
2013-01-14 10:57:39 ClientCnxn [INFO] EventThread shut down
2013-01-14 10:57:39 CuratorFrameworkImpl [INFO] Starting
2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection, connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181/storm sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@33a998c7
2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server zookeeper1.company.com/10.72.209.112:2181
2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to zookeeper1.company.com/10.72.209.112:2181, initiating session
2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on server zookeeper1.company.com/10.72.209.112:2181, sessionid = 0x13b3e4b5c78023a, negotiated timeout = 20000
2013-01-14 10:57:39 worker [ERROR] Error on initialization of server mk-worker
java.io.FileNotFoundException: File '/tmp/storm/supervisor/stormdist/normalization-prod-1-1358161053/stormconf.ser' does not exist
at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:137)
at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1135)
at backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:138)
at backtype.storm.daemon.worker$worker_data.invoke(worker.clj:146)
at backtype.storm.daemon.worker$fn_4348$exec_fn1228auto___4349.invoke(worker.clj:332)
at clojure.lang.AFn.applyToHelper(AFn.java:185)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:601)
at backtype.storm.daemon.worker$fn_4348$mk_worker_4404.doInvoke(worker.clj:323)
at clojure.lang.RestFn.invoke(RestFn.java:512)
at backtype.storm.daemon.worker$_main.invoke(worker.clj:433)
at clojure.lang.AFn.applyToHelper(AFn.java:172)
at clojure.lang.AFn.applyTo(AFn.java:151)
at backtype.storm.daemon.worker.main(Unknown Source)
2013-01-14 10:57:39 util [INFO] Halting process: ("Error on initialization")
Supervisor trace:
2013-01-14 10:59:01 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 still hasn't started
2013-01-14 10:59:02 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 still hasn't starte
.......
2013-01-14 10:59:34 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 still hasn't started
2013-01-14 10:59:35 supervisor [INFO] Worker d6735377-f0d6-4247-9f35-c8620e2b0e26 failed to start
2013-01-14 10:59:35 supervisor [INFO] Worker 234264c6-d9d6-4e8a-ab0a-8926bdd6b536 failed to start
2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id 234264c6-d9d6-4e8a-ab0a-8926bdd6b536. Current supervisor time: 1358161175. State: :disallowed, Heartbeat: nil
2013-01-14 10:59:35 supervisor [INFO] Shutting down d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536
2013-01-14 10:59:35 util [INFO] Error when trying to kill 4819. Process is probably already dead.
2013-01-14 10:59:35 supervisor [INFO] Shut down d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536
2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id d6735377-f0d6-4247-9f35-c8620e2b0e26. Current supervisor time: 1358161175. State: :disallowed, Heartbeat: nil
2013-01-14 10:59:35 supervisor [INFO] Shutting down d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26
2013-01-14 10:59:35 util [INFO] Error when trying to kill 4809. Process is probably already dead.
2013-01-14 10:59:35 supervisor [INFO] Shut down d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26
Thanks!
Oleg M.
----------
xiaokang: We also encountered this problem and the DEBUG log showed that storm-code-map is null and new-assignment is not null. I may be that new-assignment is got from zk after storm-code-map. So we changed the storm-code-map after all-assignment and the problem disapeared.
(defn mk-synchronize-supervisor [supervisor sync-processes event-manager processes-event-manager]
(fn this []
(let [conf (:conf supervisor)
storm-cluster-state (:storm-cluster-state supervisor)
^ISupervisor isupervisor (:isupervisor supervisor)
^LocalState local-state (:local-state supervisor)
sync-callback (fn [& ignored] (.add event-manager this))
storm-code-map (read-storm-code-locations storm-cluster-state sync-callback)
downloaded-storm-ids (set (read-downloaded-storm-ids conf))
all-assignment (read-assignments
storm-cluster-state
(:supervisor-id supervisor)
sync-callback)
new-assignment (->> all-assignment
(filter-key #(.confirmAssigned isupervisor %)))
assigned-storm-ids (assigned-storm-ids-from-port-assignments new-assignment)
existing-assignment (.get local-state LS-LOCAL-ASSIGNMENTS)]
(log-debug "Synchronizing supervisor")
(log-debug "Storm code map: " storm-code-map)
(log-debug "Downloaded storm ids: " downloaded-storm-ids)
(log-debug "All assignment: " all-assignment)
(log-debug "New assignment: " new-assignment)
----------
NJtwentyone: I ran into having a similar problem once. Haven't had it lately. I was going to investigate more and before I created a posted but... There are two schedule-recurring functions of intertest in the function (mk-supervisor).
(schedule-recurring (:timer supervisor) 0 10 (fn ))
(schedule-recurring (:timer supervisor)
0
(conf SUPERVISOR-MONITOR-FREQUENCY-SECS)
(fn )))
So function-1 (synchronize-supervisor) will eventually remove that dir [../stormconf.ser]
And function-2 (sync-processes) will eventually hang wanting to create workers saying repeatedly
2013-01-13 22:13:13 b.s.d.supervisor [INFO] 6f73facd-8722-4b83-959c-a7b396c61224 still hasn't started
I'm using the distributed setup of (Storm 0.8.2/zookeeper-3.4.5) on my mac and a hp using storm.starter.WordCountTopology
----------
miggi: Well this problem was resolved only after reinstalling new instance of ZK (3.3.6), we had other issues with deploy topology on 0.8.1 version, so it was decided use new instance ZK.
Btw: removing storm data folders didn't help
----------
nathanmarz: For those of you hitting the problem, it would be really useful if you could turn on DEBUG logging on the supervisor and show those logs when the problem is hit.
----------
miggi: After few months stable work - we ran into the same issue again.
Logs ("topology.debug: true")
Supervisor logs: http://cl.ly/0p2W3N0A2V3S
Worker logs: http://cl.ly/2w1j1r0a3X46
Cleaning/restarting ZK/Nimbus/Worker components didn't help.
Now we down: haven't tried solution proposed by xiaokang but probably it would be our next steps.
Any fresh ideas ?
----------
devoncrouse: +1 - ran into this twice now. We've "reset" things by clearing out Storm stuff in Zk, wiping out Storm data directory, and restarting the topology, but it happens again in time.
Storm 0.9.0-wip16
Zk 3.3.6
----------
d2r: +1 seen this happen many times. same work-around
Most recently Storm 0.8.2-wip22
I'll try to collect a look and debug logs from the supervisor and have a look.
----------
miggi: Hey guys,
try restart your cluster and ensure that all previous workers (java processes) was killed after this operation.
Also we've updated cluster to storm 0.8.3-wip1 with possible fix.
Hope it helps
----––––
Oleg M.
----------
devoncrouse: Simple restart doesn't cut it; have to totally clear out all Storm state in Zk and data directory (cluster otherwise quickly becomes unbalanced with same errors in the logs) - and again, we're on 0.9.0-wip16.
----------
xiaokang: We encountered the same problem in production cluster. The log showed that a worker died and supervisor and nimbus found that at the same time. Than nimbus assign the worker to other host and the supervisor's sync-supervisor delete topo's code while sync-process try to restart the died timeouted worker.
Our solution is change supervisor's woker timeout from 30 to 5 seconds and nimbus remain 30. It works.
----------
d2r: We have tried the same work-around from @xiaokang, but we still see the issue.
----------
nathanmarz: @d2r This is fixed in the 0.8.3 wip, please try that out and let me know if it fixes the issue for you.
----------
d2r: @nathanmarz Yes, pulling in the fix to supervisor.clj from 0.8.3 did resolve the issue for us. Thanks!
----------
devoncrouse: Is this fix going to be applied to the newer builds (e.g. 0.9.0) as well?
d5207b5
---------
viceyang: We encountered the same problem, our version is 0.8.3, i read supervisor code, it seems not a bug, the key reason is "synchronize-supervisor" which responsible for download file and remove file thread and "sync-processes" which responsible for start worker process thread is Asynchronous.
see this case: in synchronize-supervisor read assigment information from zk, supervisor download necessary file from nimbus and write local state . in aother thread sync-processes funciton read local state to launch workor process, when the worker process has not start ,synchronize-supervisor function is called again topology's assignment information has changed (cased by rebalance,or worker time out etc.) worker assignment to this supervisor has move to another supervisor, synchronize-supervisor remove the unnecessary file (jar file and ser file etc.) , after this, worker launched by " sync-processes" ,ser file was not exsit , this issue occur.
Though in this issue exception occured but logic sees right. this excetion can't compelety Eliminate unless change thread Asynchronous feather.
there is some way reduction the exception occur, @xiaokang metioned "change supervisor's woker timeout from 30 to 5 seconds and nimbus remain 30" this method redeuce nimbus reassign topology. another way change "synchronize-supervisor" thread loop time to a longger than 10(default time) sec, such as 30 sec。