Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-1114

Racing condition in trident zookeeper zk-node create/delete

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      In production for some trident topology, we met the bug that some workers are trying to create a zk-node that is already existent or delete a zk node that has already been deleted. This causes the worker process to die.

      We dissect the problem and figure out that there exists racing condition in trident TransactionalState's zk-node create and delete codes.

      failure stack trace in worker.log:

      Caused by: org.apache.storm.shade.org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /ignoreStoredMetadata
              at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:119) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:676) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:660) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:656) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:441) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:431) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.curator.framework.imps.CreateBuilderImpl$3.forPath(CreateBuilderImpl.java:239) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.curator.framework.imps.CreateBuilderImpl$3.forPath(CreateBuilderImpl.java:193) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at storm.trident.topology.state.TransactionalState.forPath(TransactionalState.java:83) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at storm.trident.topology.state.TransactionalState.createNode(TransactionalState.java:100) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at storm.trident.topology.state.TransactionalState.setData(TransactionalState.java:115) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              ... 9 more
      2015-10-14 18:10:43.786 b.s.util [ERROR] Halting process: ("Worker died")
      
      Caused by: org.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /rainbowHdfsPath
              at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:111) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:239) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:234) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:230) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:215) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at org.apache.storm.shade.org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:42) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              at storm.trident.topology.state.TransactionalState.delete(TransactionalState.java:126) ~[storm-core-0.10.1.y.jar:0.10.1.y]
              ... 12 more
      2015-10-14 18:10:28.799 b.s.util [ERROR] Halting process: ("Worker died")
      java.lang.RuntimeException: ("Worker died")
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ptgoetz P. Taylor Goetz
            zhuoliu Zhuo Liu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 50m
                50m

                Slack

                  Issue deployment