[ACCUMULO-1449] Connector/ZooCache code enters infinite loop when Zookeeper connection lost. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 1.5.0
Fix Version/s: None
Component/s: client
Labels:
None
Environment:

accumulo-1.5.0-RC4, zookeeper-3.4.5, hadoop-1.0.4, CentOS 6.4

Description

While using 1.5.0-RC4 a long-lived Connector went into an infinite loop of Zookeeper "ConnectionLoss" and "Session expired" failures. In a multithreaded application, all using the same Connector, there were errors whenever there were calls to conn.createScanner() and conn.createBatchScanner(). Here are a couple stacktraces:

013-05-22 09:12:28,250 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /accumulo/5e982cc9-6959-4064-9712-2ff3dc1003d8
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208)
	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:130)
	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:233)
	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:188)
	at org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:151)
	at org.apache.accumulo.core.zookeeper.ZooUtil.getRoot(ZooUtil.java:24)
	at org.apache.accumulo.core.client.impl.Tables.getMap(Tables.java:46)
	at org.apache.accumulo.core.client.impl.Tables.getNameToIdMap(Tables.java:78)
	at org.apache.accumulo.core.client.impl.Tables.getTableId(Tables.java:64)
	at org.apache.accumulo.core.client.impl.ConnectorImpl.getTableId(ConnectorImpl.java:75)
	at org.apache.accumulo.core.client.impl.ConnectorImpl.createScanner(ConnectorImpl.java:137)

    
2013-05-22 09:12:23,849 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/5e982cc9-6959-4064-9712-2ff3dc1003d8
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208)
	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:130)
	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:233)
	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:188)
	at org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:151)
	at org.apache.accumulo.core.zookeeper.ZooUtil.getRoot(ZooUtil.java:24)
	at org.apache.accumulo.core.client.impl.Tables.getMap(Tables.java:46)
	at org.apache.accumulo.core.client.impl.Tables.getNameToIdMap(Tables.java:78)
	at org.apache.accumulo.core.client.impl.Tables.getTableId(Tables.java:64)
	at org.apache.accumulo.core.client.impl.ConnectorImpl.getTableId(ConnectorImpl.java:75)
	at org.apache.accumulo.core.client.impl.ConnectorImpl.createBatchScanner(ConnectorImpl.java:89)

The method ZooCache.retry(ZooRunnable op) (ZooCache.java:128) has a while(true) loop that should probably have a max retries or timeout that will eventually cause the method to throw an exception that can be handled appropriately by the client. As it is currently, this loop will never be exited when Zookeeper continues to error.

Note: There may have been a network hiccup that triggered the bug, but there was no way to recover without restarting the application.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Luke Brassard

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 22/May/13 19:23

Updated:: 23/May/15 19:05

Resolved:: 23/May/15 19:05