Details
Description
We have a microservice-based software stack that starts its core services relatively simultaneously. For this we rely on Docker and Rancher.
We noticed that occasionally the Apache Storm Nimbus, that relies on Apache Curator's LeaderLatch, cannot agree on a leader. This comes due a missing ZNode:
org.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /storm/leader-lock at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:111) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:230) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:219) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:216) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:207) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:40) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.curator.framework.recipes.locks.LockInternals.getSortedChildren(LockInternals.java:151) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.curator.framework.recipes.locks.LockInternals.getParticipantNodes(LockInternals.java:133) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:453) ~[storm-core-1.1.1.jar:1.1.1] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_131] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_131] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_131] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_131] at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) ~[clojure-1.7.0.jar:?] at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313) ~[clojure-1.7.0.jar:?] at org.apache.storm.zookeeper$zk_leader_elector$reify__1043.getLeader(zookeeper.clj:296) ~[storm-core-1.1.1.jar:1.1.1] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_131] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_131] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_131] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_131] at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) ~[clojure-1.7.0.jar:?] at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313) ~[clojure-1.7.0.jar:?] at org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1544) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__10780.getClusterInfo(nimbus.clj:2006) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3920) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3904) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.security.auth.SimpleTransportPlugin$SimpleWrapProcessor.process(SimpleTransportPlugin.java:162) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:518) ~[storm-core-1.1.1.jar:1.1.1] at org.apache.storm.thrift.server.Invocation.run(Invocation.java:18) ~[storm-core-1.1.1.jar:1.1.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131] 11:29:13.229 [timer] INFO org.apache.storm.daemon.nimbus - not a leader, skipping assignments 11:29:13.229 [timer] INFO org.apache.storm.daemon.nimbus - not a leader, skipping cleanup
even though a create request has been issued for this ZNode:
2017-10-06 11:27:04,387 [myid:3] - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x35ef170475e0001 type:create cxid:0x1 zxid:0x100000013 txntype:-1 reqpath:n/a Error Path:/storm/leader-lock Error:KeeperErrorCode = NoNode for /storm/leader-lock
Verifying the Zookeeper data via zkCli.sh shows that the ZNode is in fact not present:
[zk: localhost:2181(CONNECTED) 0] ls /storm [assignments, backpressure, nimbuses, logconfigs, storms, errors, supervisors, workerbeats, blobstore]
We noticed, that in such a case, the following log entry in the Nimbus log is missing
[Curator-Framework-0] WARN org.apache.storm.shade.org.apache.curator.utils.ZKPaths - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
and if everything works, the log entry is present.
We assume that the correct CreateMode for the ZNode someone cannot be determined (as "Container" ZNodes are only available in Zookeeper 3.5.1 onwards).
Please notice the startup phase in which several attempts to connect to Zookeeper are not successful until it is fully reachable.
Full logs are attached. Zookeeper is run as a 3-node cluster in this setup. Happens with a single node instance also.
Attachments
Attachments
Issue Links
- is superceded by
-
CURATOR-436 NoNodeException: KeeperErrorCode KeeperException in storm-nimbus
- Resolved