Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Cannot Reproduce
-
1.5.2, 1.6.1
-
None
-
None
Description
Saw the following
2014-11-14 08:38:30,612 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq 2014-11-14 08:38:30,621 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception communicating with ZooKeeper, will retry org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/config/tserver.compaction.warn.time at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260) at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157) at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285) at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232) at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:96) at org.apache.accumulo.server.conf.ZooConfiguration._get(ZooConfiguration.java:65) at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:90) at org.apache.accumulo.core.conf.AccumuloConfiguration.getTimeInMillis(AccumuloConfiguration.java:136) at org.apache.accumulo.tserver.CompactionWatcher.run(CompactionWatcher.java:84) at org.apache.accumulo.server.util.time.SimpleTimer$LoggingTimerTask.run(SimpleTimer.java:42) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) 2014-11-14 08:38:30,672 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery 2014-11-14 08:38:30,672 [zookeeper.ZooLock] DEBUG: event null None Disconnected 2014-11-14 08:38:31,484 [zookeeper.ZooReader] WARN : Saw (possibly) transient exception communicating with ZooKeeper org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tservers/ip-172-31-13-177:37709 at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java:109) at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:381) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 2014-11-14 08:38:31,484 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception communicating with ZooKeeper, will retry org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tables/!0/namespace at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260) at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157) at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285) at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232) at org.apache.accumulo.core.client.impl.Tables.getNamespaceId(Tables.java:304) at org.apache.accumulo.server.conf.TableParentConfiguration.getNamespaceId(TableParentConfiguration.java:47) at org.apache.accumulo.server.conf.NamespaceConfiguration.getPath(NamespaceConfiguration.java:85) at org.apache.accumulo.server.conf.NamespaceConfiguration.get(NamespaceConfiguration.java:98) at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:107) at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:103) at org.apache.accumulo.core.conf.AccumuloConfiguration.getCount(AccumuloConfiguration.java:193) at org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:2636) at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) at java.lang.Thread.run(Thread.java:745) 2014-11-14 08:38:31,484 [zookeeper.Retry] DEBUG: Sleeping for 250ms before retrying operation 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session to localhost:12644 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with timeout 30000 with auth 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session to localhost:12644 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with timeout 30000 with auth 2014-11-14 08:38:31,692 [tserver.TabletServer] DEBUG: gc ParNew=0.10(+0.04) secs ConcurrentMarkSweep=0.05(+0.00) secs freemem=118,013,904(+6,412,200) totalmem=129,761,280 2014-11-14 08:38:31,692 [tserver.TabletServer] WARN : GC pause checker not called in a timely fashion. Expected every 5.0 seconds but was 43.1 seconds since last check 2014-11-14 08:38:31,700 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq 2014-11-14 08:38:31,701 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery 2014-11-14 08:38:31,715 [tserver.TabletServer] DEBUG: ScanSess tid 172.31.13.177:35935 !0 1 entries in 0.03 secs, nbTimes = [24 24 24.00 1] 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Scanning trace hosts in zookeeper: /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tracers 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Trace hosts: [] 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/replication/workqueue 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery 2014-11-14 08:38:31,739 [zookeeper.ZooSession] DEBUG: Session expired, state of current session : Expired 2014-11-14 08:38:31,739 [zookeeper.ZooLock] DEBUG: event null None Expired 2014-11-14 08:38:31,741 [tserver.TabletServer] FATAL: Lost tablet server lock (reason = SESSION_EXPIRED), exiting.
ZooKeeper code appears to had disconnected, closed the disconnected connection and then opened a new session. However, the ZooLock, IIRC, didn't reconnect and hung the tserver.
If we want to support this, it might require rehashing some of the ZooLock code (to prevent the tserver from processing while the tserver doesn't have its lock).