Uploaded image for project: 'Apache Curator'
  1. Apache Curator
  2. CURATOR-435

ZNodes are created with incompatible CreateMode under racing conditions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 2.12.0
    • None
    • Client, Recipes
    • None
    • Apache Storm 1.1.1 (uses Zookeeper 3.4.6 and Curator 2.12.0)
      Apache Zookeeper 3.4.10
      OpenJDK 1.8.0_131
      Debian 8
      Docker 17.09.0-ce
      Rancher 1.6.9

    Description

      We have a microservice-based software stack that starts its core services relatively simultaneously. For this we rely on Docker and Rancher.

      We noticed that occasionally the Apache Storm Nimbus, that relies on Apache Curator's LeaderLatch, cannot agree on a leader. This comes due a missing ZNode:

      org.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /storm/leader-lock
      	at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:111) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:230) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:219) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:216) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:207) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.shade.org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:40) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.shade.org.apache.curator.framework.recipes.locks.LockInternals.getSortedChildren(LockInternals.java:151) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.shade.org.apache.curator.framework.recipes.locks.LockInternals.getParticipantNodes(LockInternals.java:133) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:453) ~[storm-core-1.1.1.jar:1.1.1]
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_131]
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_131]
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_131]
      	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_131]
      	at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) ~[clojure-1.7.0.jar:?]
      	at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313) ~[clojure-1.7.0.jar:?]
      	at org.apache.storm.zookeeper$zk_leader_elector$reify__1043.getLeader(zookeeper.clj:296) ~[storm-core-1.1.1.jar:1.1.1]
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_131]
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_131]
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_131]
      	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_131]
      	at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) ~[clojure-1.7.0.jar:?]
      	at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313) ~[clojure-1.7.0.jar:?]
      	at org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1544) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__10780.getClusterInfo(nimbus.clj:2006) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3920) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3904) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.security.auth.SimpleTransportPlugin$SimpleWrapProcessor.process(SimpleTransportPlugin.java:162) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:518) ~[storm-core-1.1.1.jar:1.1.1]
      	at org.apache.storm.thrift.server.Invocation.run(Invocation.java:18) ~[storm-core-1.1.1.jar:1.1.1]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
      	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
      11:29:13.229 [timer] INFO  org.apache.storm.daemon.nimbus - not a leader, skipping assignments
      11:29:13.229 [timer] INFO  org.apache.storm.daemon.nimbus - not a leader, skipping cleanup
      

      even though a create request has been issued for this ZNode:

      2017-10-06 11:27:04,387 [myid:3] - INFO  [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x35ef170475e0001 type:create cxid:0x1 zxid:0x100000013 txntype:-1 reqpath:n/a Error Path:/storm/leader-lock Error:KeeperErrorCode = NoNode for /storm/leader-lock
      

      Verifying the Zookeeper data via zkCli.sh shows that the ZNode is in fact not present:

      [zk: localhost:2181(CONNECTED) 0] ls /storm
      [assignments, backpressure, nimbuses, logconfigs, storms, errors, supervisors, workerbeats, blobstore]
      

      We noticed, that in such a case, the following log entry in the Nimbus log is missing

      [Curator-Framework-0] WARN  org.apache.storm.shade.org.apache.curator.utils.ZKPaths - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
      

      and if everything works, the log entry is present.

      We assume that the correct CreateMode for the ZNode someone cannot be determined (as "Container" ZNodes are only available in Zookeeper 3.5.1 onwards).
      Please notice the startup phase in which several attempts to connect to Zookeeper are not successful until it is fully reachable.

      Full logs are attached. Zookeeper is run as a 3-node cluster in this setup. Happens with a single node instance also.

      Attachments

        1. zk_leader-lock_nimbus.log
          133 kB
          Daniel Klessing
        2. zk_leader-lock_SUCCESS_nimbus.log
          22 kB
          Daniel Klessing
        3. zk_leader-lock_SUCCESS_zookeeper.log
          7 kB
          Daniel Klessing
        4. zk_leader-lock_zookeeper-1.log
          20 kB
          Daniel Klessing
        5. zk_leader-lock_zookeeper-2.log
          14 kB
          Daniel Klessing
        6. zk_leader-lock_zookeeper-3.log
          57 kB
          Daniel Klessing

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sakanaou Daniel Klessing
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: