Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-4060

Transient ZooKeeper connection issues kills FATE Runner threads

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.4, 1.7.0
    • Fix Version/s: 1.6.5, 1.7.1, 1.8.0
    • Component/s: fate, master
    • Labels:
      None

      Description

      Noticed this the following on a 6 node Accumulo cluster with Kerberos and quality of protection set to auth-conf (wire encryption). The cluster appeared to be up and running – healthy. Attempts to create a table via the shell was hung in the CreateTableCommand, polling on the FATE operation. After a few minutes, it made no progress.

      Inspecting the FATE transactions showed that there were (multiple) FATE ops running, but none where locked or locking any tables, nor making any progress.

      This lead me to inspect the Master's log to figure out why it wasn't making any progress, and, to my joy, I found the following:

      2015-11-18 23:18:30,784 [fate.Fate] ERROR: Thread "Repo runner 0" died org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
      java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
              at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
              at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
              at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
              at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
              at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
              at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
              at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
              ... 6 more
      2015-11-18 23:18:30,783 [fate.Fate] ERROR: Thread "Repo runner 2" died org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
      java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
              at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
              at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
              at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
              at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
              at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
              at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
              at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
              ... 6 more
      2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 1" died org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
      java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
              at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
              at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
              at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
              at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
              at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
              at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
              at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
              ... 6 more
      2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 3" died org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
      java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
              at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
              at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
              at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
              at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
              at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
              at org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
              at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
              ... 6 more
      

      This happened at the end of a ~30s period of difficulties in the Master communicating with ZooKeeper. I've yet to investigate why this pause happened, but the fact that the FATE runner threads died and the Master kept running is no good.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                elserj Josh Elser
                Reporter:
                elserj Josh Elser
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h