Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-2400

Intermittent failure in nimbus because of errors from LeaderLatch#getLeader()

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.0, 1.1.0
    • Component/s: None
    • Labels:
      None

      Description

      This issue is reported to Curator with CURATOR-358.

      org.apache.curator.framework.recipes.leader.LeaderLatch#getLeader() throws KeeperException with Code#NONODE intermittently as mentioned in the stack trace below. It may be possible participant's ephemeral ZK node is removed because its connection/session is closed.

      You can see the below code at https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L451

      public Participant getLeader() throws Exception
      { 
        Collection<String> participantNodes = LockInternals.getParticipantNodes(client, latchPath, LOCK_NAME, sorter); 
        return LeaderSelector.getLeader(client, participantNodes); 
      }
      

      I guess it hits a race condition where a participant node is retrieved but when it invokes LeaderSelector#getLeader() it would have been removed because of session timeout and it throws KeeperException with NoNode code. It does not retry as the RetryLoop retries only for connection/session timeouts. But in this case, NoNode should have been retried. I could not find any APIs on CuratorClient to configure the kind of KeeperException codes to be retried. It may be good to have a way to take what kind of errors should be retried in org.apache.curator.framework.CuratorFrameworkFactory.Builder APIs.
      Intermittent Exception found with the stack trace:

      2016-11-15 06:09:33.954 o.a.s.d.nimbus [ERROR] Error when processing event
      org.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /storm/leader-lock/_c_97c09eed-5bba-4ac8-a05f-abdc4e8e95cf-latch-0000000002
      at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
      at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      at org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
      at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:304)
      at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:293)
      at org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108)
      at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:290)
      at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:281)
      at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:42)
      at org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:375)
      at org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:346)
      at org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:454)
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                satish.duggana Satish Duggana
                Reporter:
                satish.duggana Satish Duggana
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h