Accumulo
  1. Accumulo
  2. ACCUMULO-1268 add client wide timeout setting
  3. ACCUMULO-1449

Connector/ZooCache code enters infinite loop when Zookeeper connection lost.

    Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 1.5.0
    • Fix Version/s: None
    • Component/s: client
    • Labels:
      None
    • Environment:

      accumulo-1.5.0-RC4, zookeeper-3.4.5, hadoop-1.0.4, CentOS 6.4

      Description

      While using 1.5.0-RC4 a long-lived Connector went into an infinite loop of Zookeeper "ConnectionLoss" and "Session expired" failures. In a multithreaded application, all using the same Connector, there were errors whenever there were calls to conn.createScanner() and conn.createBatchScanner(). Here are a couple stacktraces:

      013-05-22 09:12:28,250 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
      org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /accumulo/5e982cc9-6959-4064-9712-2ff3dc1003d8
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
      	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208)
      	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:130)
      	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:233)
      	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:188)
      	at org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:151)
      	at org.apache.accumulo.core.zookeeper.ZooUtil.getRoot(ZooUtil.java:24)
      	at org.apache.accumulo.core.client.impl.Tables.getMap(Tables.java:46)
      	at org.apache.accumulo.core.client.impl.Tables.getNameToIdMap(Tables.java:78)
      	at org.apache.accumulo.core.client.impl.Tables.getTableId(Tables.java:64)
      	at org.apache.accumulo.core.client.impl.ConnectorImpl.getTableId(ConnectorImpl.java:75)
      	at org.apache.accumulo.core.client.impl.ConnectorImpl.createScanner(ConnectorImpl.java:137)
      
          
      2013-05-22 09:12:23,849 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
      org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/5e982cc9-6959-4064-9712-2ff3dc1003d8
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
      	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208)
      	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:130)
      	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:233)
      	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:188)
      	at org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:151)
      	at org.apache.accumulo.core.zookeeper.ZooUtil.getRoot(ZooUtil.java:24)
      	at org.apache.accumulo.core.client.impl.Tables.getMap(Tables.java:46)
      	at org.apache.accumulo.core.client.impl.Tables.getNameToIdMap(Tables.java:78)
      	at org.apache.accumulo.core.client.impl.Tables.getTableId(Tables.java:64)
      	at org.apache.accumulo.core.client.impl.ConnectorImpl.getTableId(ConnectorImpl.java:75)
      	at org.apache.accumulo.core.client.impl.ConnectorImpl.createBatchScanner(ConnectorImpl.java:89)
      

      The method ZooCache.retry(ZooRunnable op) (ZooCache.java:128) has a while(true) loop that should probably have a max retries or timeout that will eventually cause the method to throw an exception that can be handled appropriately by the client. As it is currently, this loop will never be exited when Zookeeper continues to error.

      Note: There may have been a network hiccup that triggered the bug, but there was no way to recover without restarting the application.

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        730d 23h 41m 1 Josh Elser 23/May/15 19:05
        Josh Elser made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 1.8.0 [ 12329879 ]
        Resolution Cannot Reproduce [ 5 ]
        Hide
        Josh Elser added a comment -

        We haven't seen any more reports of this issue. I've made a few improvements to our ZooKeeper code since 1.5.0 specifically in this area. I'm not sure if it's been definitively addressed. Either way, client-wide timeouts can/should still be done in the parent.

        Show
        Josh Elser added a comment - We haven't seen any more reports of this issue. I've made a few improvements to our ZooKeeper code since 1.5.0 specifically in this area. I'm not sure if it's been definitively addressed. Either way, client-wide timeouts can/should still be done in the parent.
        Josh Elser made changes -
        Fix Version/s 1.8.0 [ 12329879 ]
        Fix Version/s 1.7.0 [ 12324607 ]
        Hide
        Josh Elser added a comment -

        It may be good to just pull up the recent changes I made to ZooReader/ZooUtil/ZooReaderWriter to use a Retry class. That would at least be a short-term fix as opposed to a sweeping change to add timeouts everywhere.

        Show
        Josh Elser added a comment - It may be good to just pull up the recent changes I made to ZooReader/ZooUtil/ZooReaderWriter to use a Retry class. That would at least be a short-term fix as opposed to a sweeping change to add timeouts everywhere.
        John Vines made changes -
        Fix Version/s 1.7.0 [ 12324607 ]
        Fix Version/s 1.6.0 [ 12322468 ]
        Fix Version/s 1.5.1 [ 12324399 ]
        John Vines made changes -
        Parent ACCUMULO-1268 [ 12642129 ]
        Issue Type Bug [ 1 ] Sub-task [ 7 ]
        Hide
        Keith Turner added a comment -

        It seems like this is a smaller part of a larger problem ACCUMULO-1268. I think instead of doing a one off fix here, this should be addressed in a comprehensive manner.

        If ZooCache is retrying for an unrecoverable exception, then I think it would be ok for it to just throw a runtime exception in that case. Seems like this could be done for the SessionExpiredException.

        Show
        Keith Turner added a comment - It seems like this is a smaller part of a larger problem ACCUMULO-1268 . I think instead of doing a one off fix here, this should be addressed in a comprehensive manner. If ZooCache is retrying for an unrecoverable exception, then I think it would be ok for it to just throw a runtime exception in that case. Seems like this could be done for the SessionExpiredException.
        Hide
        John Vines added a comment -

        So, poking through the code, I think the best plan of action is allowing the KeeperException to eventually percolate out to indicate an error. It seems only get and getChildren are the only methods which use it, so we could just have them return null, but I'm concerned about overloading returns like that. I'm thinking the least intrusive way to handle this is to switch ZooKeeperInstance to use a new cache method which gets and has a timeout involved and can throw a KeeperException, this way the 60-100 or so implementations of get and getChildren don't need to be updated to handle KeeperExceptions themselves.

        Show
        John Vines added a comment - So, poking through the code, I think the best plan of action is allowing the KeeperException to eventually percolate out to indicate an error. It seems only get and getChildren are the only methods which use it, so we could just have them return null, but I'm concerned about overloading returns like that. I'm thinking the least intrusive way to handle this is to switch ZooKeeperInstance to use a new cache method which gets and has a timeout involved and can throw a KeeperException, this way the 60-100 or so implementations of get and getChildren don't need to be updated to handle KeeperExceptions themselves.
        Christopher Tubbs made changes -
        Fix Version/s 1.6.0 [ 12322468 ]
        Christopher Tubbs made changes -
        Fix Version/s 1.5.1 [ 12324399 ]
        Christopher Tubbs made changes -
        Field Original Value New Value
        Affects Version/s 1.5.0 [ 12318645 ]
        Affects Version/s 1.5.1 [ 12324399 ]
        Luke Brassard created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Luke Brassard
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development