Accumulo
  1. Accumulo
  2. ACCUMULO-1268 add client wide timeout setting
  3. ACCUMULO-1449

Connector/ZooCache code enters infinite loop when Zookeeper connection lost.

    Details

    • Type: Sub-task Sub-task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.5.0
    • Fix Version/s: 1.7.0
    • Component/s: client
    • Labels:
      None
    • Environment:

      accumulo-1.5.0-RC4, zookeeper-3.4.5, hadoop-1.0.4, CentOS 6.4

      Description

      While using 1.5.0-RC4 a long-lived Connector went into an infinite loop of Zookeeper "ConnectionLoss" and "Session expired" failures. In a multithreaded application, all using the same Connector, there were errors whenever there were calls to conn.createScanner() and conn.createBatchScanner(). Here are a couple stacktraces:

      013-05-22 09:12:28,250 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
      org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /accumulo/5e982cc9-6959-4064-9712-2ff3dc1003d8
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
      	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208)
      	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:130)
      	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:233)
      	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:188)
      	at org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:151)
      	at org.apache.accumulo.core.zookeeper.ZooUtil.getRoot(ZooUtil.java:24)
      	at org.apache.accumulo.core.client.impl.Tables.getMap(Tables.java:46)
      	at org.apache.accumulo.core.client.impl.Tables.getNameToIdMap(Tables.java:78)
      	at org.apache.accumulo.core.client.impl.Tables.getTableId(Tables.java:64)
      	at org.apache.accumulo.core.client.impl.ConnectorImpl.getTableId(ConnectorImpl.java:75)
      	at org.apache.accumulo.core.client.impl.ConnectorImpl.createScanner(ConnectorImpl.java:137)
      
          
      2013-05-22 09:12:23,849 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
      org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/5e982cc9-6959-4064-9712-2ff3dc1003d8
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
      	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208)
      	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:130)
      	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:233)
      	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:188)
      	at org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:151)
      	at org.apache.accumulo.core.zookeeper.ZooUtil.getRoot(ZooUtil.java:24)
      	at org.apache.accumulo.core.client.impl.Tables.getMap(Tables.java:46)
      	at org.apache.accumulo.core.client.impl.Tables.getNameToIdMap(Tables.java:78)
      	at org.apache.accumulo.core.client.impl.Tables.getTableId(Tables.java:64)
      	at org.apache.accumulo.core.client.impl.ConnectorImpl.getTableId(ConnectorImpl.java:75)
      	at org.apache.accumulo.core.client.impl.ConnectorImpl.createBatchScanner(ConnectorImpl.java:89)
      

      The method ZooCache.retry(ZooRunnable op) (ZooCache.java:128) has a while(true) loop that should probably have a max retries or timeout that will eventually cause the method to throw an exception that can be handled appropriately by the client. As it is currently, this loop will never be exited when Zookeeper continues to error.

      Note: There may have been a network hiccup that triggered the bug, but there was no way to recover without restarting the application.

        Activity

        Luke Brassard created issue -
        Christopher Tubbs made changes -
        Field Original Value New Value
        Affects Version/s 1.5.0 [ 12318645 ]
        Affects Version/s 1.5.1 [ 12324399 ]
        Christopher Tubbs made changes -
        Fix Version/s 1.5.1 [ 12324399 ]
        Christopher Tubbs made changes -
        Fix Version/s 1.6.0 [ 12322468 ]
        John Vines made changes -
        Parent ACCUMULO-1268 [ 12642129 ]
        Issue Type Bug [ 1 ] Sub-task [ 7 ]
        John Vines made changes -
        Fix Version/s 1.7.0 [ 12324607 ]
        Fix Version/s 1.6.0 [ 12322468 ]
        Fix Version/s 1.5.1 [ 12324399 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Luke Brassard
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development