Uploaded image for project: 'Apache Curator'
  1. Apache Curator
  2. CURATOR-209

Background retry falls into infinite loop of reconnection after connection loss

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.6.0
    • Fix Version/s: 2.10.0
    • Component/s: Framework
    • Environment:
      sun java jdk 1.7.0_55, curator 2.6.0, zookeeper 3.3.6 on AWS EC2 in a 3 box ensemble

      Description

      We've been unable to replicate this in our test environments, but approximately once a week in production (~50 machine cluster using curator/zk for service discovery) we will get a machine falling into a loop and spewing tens of thousands of errors that look like:

      Background operation retry gave uporg.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
      at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695) [curator-framework-2.6.0.jar:na]
      at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:496) [curator-framework-2.6.0.jar:na]
      at org.apache.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:538) [curator-framework-2.6.0.jar:na]
      at org.apache.curator.framework.imps.CreateBuilderImpl.access$700(CreateBuilderImpl.java:44) [curator-framework-2.6.0.jar:na]
      at org.apache.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:497) [curator-framework-2.6.0.jar:na]
      at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) [zookeeper-3.4.6.jar:3.4.6-1569965]
      at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) [zookeeper-3.4.6.jar:3.4.6-1569965]
      

      The rate at which we get these errors seems to increase linearly until we stop the process (starts at 10-20/sec, when we kill the box it's typically generating 1,000+/sec)

      When the error first occurs, there's a slightly different stack trace:

      Background operation retry gave uporg.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
      at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695) [curator-framework-2.6.0.jar:na]
      at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:813) [curator-framework-2.6.0.jar:na]
      at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779) [curator-framework-2.6.0.jar:na]
      at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58) [curator-framework-2.6.0.jar:na]
      at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265) [curator-framework-2.6.0.jar:na]
      at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_55]
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_55]
      at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
      

      followed very closely by:

      Background retry gave uporg.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
      at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:796) [curator-framework-2.6.0.jar:na]
      at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779) [curator-framework-2.6.0.jar:na]
      at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58) [curator-framework-2.6.0.jar:na]
      at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265) [curator-framework-2.6.0.jar:na]
      at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_55]
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_55]
      at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
      

      After which it begins spewing the stack trace I first posted above. We're assuming that some sort of networking hiccup is occurring in EC2 that's causing the ConnectionLoss, which seems entirely momentary (none of our other boxes see it, and when we check the box it can connect to all the zk servers without any issues.)

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Nakano Ryan Anderson
            • Votes:
              4 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: