Accumulo
  1. Accumulo
  2. ACCUMULO-1572

single node zookeeper failure kills connected accumulo servers

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.5.0
    • Fix Version/s: 1.4.5, 1.5.1, 1.6.0
    • Component/s: master, tserver
    • Labels:
      None

      Description

      Drew Thornton writes on the user mailing list:

      If one zookeeper node is shutdown/fails/whatever and the rest of the ensemble stays up, the tablet servers attached as clients to the shutdown node immediately fail. If one of the clients happens to be the master, the cluster goes down.

      Accumulo does not seem to be failing over to the remaining zookeeper nodes, and this causes me to restart the individual tablet servers again.

      The zookeeper ensemble is very stable and has plenty of bandwidth/memory/processing, so taking one node down out of five doesn't crash the zookeepers, just the tablet servers...

        Issue Links

          Activity

          Hide
          C Drew Thornton added a comment -

          There are two configurations where I have experienced this:

          1) The Zookeeper leader is removed from the ensemble and a new leader is elected.
          Result: Master immediately goes down along with the tablet servers.

          2) According to http://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html, the leader should not serve clients in 5+ node ensemble (zoo.cfg - leaderServes=no). Accumulo does not seem to anticipate this situation, so perhaps when a node is removed, the clients attempt to reestablish connection at the leader, who is not serving, and they fail. This may appear as two failures in the ensemble.
          Result: Tablet servers die who were clients of the removed node. If the master was one of the clients, then the cluster goes down.

          Versions:
          Zookeeper 3.4.5 - 5 nodes with 4 processors and 6GB system memory (-Xmx4096m -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC)
          CDH 4.3 - MRv1, no YARN - no special configuration
          Accumulo 1.5.0 - adjusted for system memory

          Show
          C Drew Thornton added a comment - There are two configurations where I have experienced this: 1) The Zookeeper leader is removed from the ensemble and a new leader is elected. Result: Master immediately goes down along with the tablet servers. 2) According to http://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html , the leader should not serve clients in 5+ node ensemble (zoo.cfg - leaderServes=no). Accumulo does not seem to anticipate this situation, so perhaps when a node is removed, the clients attempt to reestablish connection at the leader, who is not serving, and they fail. This may appear as two failures in the ensemble. Result: Tablet servers die who were clients of the removed node. If the master was one of the clients, then the cluster goes down. Versions: Zookeeper 3.4.5 - 5 nodes with 4 processors and 6GB system memory (-Xmx4096m -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC) CDH 4.3 - MRv1, no YARN - no special configuration Accumulo 1.5.0 - adjusted for system memory
          Hide
          ASF subversion and git services added a comment -

          Commit 333062d27e25ee227365357bdca237b0c6912f68 in branch refs/heads/1.4.4-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=333062d ]

          ACCUMULO-1572 ignore connection lost; eventually we'll get an session lost event

          Show
          ASF subversion and git services added a comment - Commit 333062d27e25ee227365357bdca237b0c6912f68 in branch refs/heads/1.4.4-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=333062d ] ACCUMULO-1572 ignore connection lost; eventually we'll get an session lost event
          Hide
          ASF subversion and git services added a comment -

          Commit 7b617230979811d0e0ec8fffa6b633b70278c466 in branch refs/heads/1.5.1-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=7b61723 ]

          ACCUMULO-1572 ignore connection lost; eventually we'll get an session lost event

          Show
          ASF subversion and git services added a comment - Commit 7b617230979811d0e0ec8fffa6b633b70278c466 in branch refs/heads/1.5.1-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=7b61723 ] ACCUMULO-1572 ignore connection lost; eventually we'll get an session lost event
          Hide
          ASF subversion and git services added a comment -

          Commit 7b617230979811d0e0ec8fffa6b633b70278c466 in branch refs/heads/master from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=7b61723 ]

          ACCUMULO-1572 ignore connection lost; eventually we'll get an session lost event

          Show
          ASF subversion and git services added a comment - Commit 7b617230979811d0e0ec8fffa6b633b70278c466 in branch refs/heads/master from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=7b61723 ] ACCUMULO-1572 ignore connection lost; eventually we'll get an session lost event
          Hide
          ASF subversion and git services added a comment -

          Commit 388d58c6d02224e76fab77db852258eccc2dab7a in branch refs/heads/master from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=388d58c ]

          ACCUMULO-1572 integration test

          Show
          ASF subversion and git services added a comment - Commit 388d58c6d02224e76fab77db852258eccc2dab7a in branch refs/heads/master from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=388d58c ] ACCUMULO-1572 integration test
          Hide
          Eric Newton added a comment -

          Wrote an integration test that reproduced the problem, then eliminated the fail-fast on connection lost.

          Show
          Eric Newton added a comment - Wrote an integration test that reproduced the problem, then eliminated the fail-fast on connection lost.
          Hide
          Eric Newton added a comment -

          Keith, what do you think needs to be resolved?

          Show
          Eric Newton added a comment - Keith, what do you think needs to be resolved?
          Hide
          Keith Turner added a comment -

          We were discussing how connection loss events were handled in a change to zoolock and decided some changes needed to be made. I think watchingParent can get set to false when it should not.

          Show
          Keith Turner added a comment - We were discussing how connection loss events were handled in a change to zoolock and decided some changes needed to be made. I think watchingParent can get set to false when it should not.
          Hide
          Keith Turner added a comment -

          Its confusing that the ticket is not marked for 1.4.4 since a change was made to 1.4.4 under this ticket. The following change was made for 1.5 and 1.6 but not 1.4 under this ticket. Its in this change that watchingParent may not be handled properly. Eric Newton why wasn't the following change made for 1.4?

          @@ -349,6 +349,9 @@ public class ZooLock implements Watcher {
                 try { // set the watch on the parent node again
                   zooKeeper.getStatus(path, this);
                   watchingParent = true;
          +      } catch (KeeperException.ConnectionLossException ex) {
          +        // we can't look at the lock because we aren't connected, but our session is still good
          +        log.warn("lost connection to zookeeper");
                 } catch (Exception ex) {
                   if (lock != null || asyncLock != null) {
                     lockWatcher.unableToMonitorLockNode(ex);
          
          Show
          Keith Turner added a comment - Its confusing that the ticket is not marked for 1.4.4 since a change was made to 1.4.4 under this ticket. The following change was made for 1.5 and 1.6 but not 1.4 under this ticket. Its in this change that watchingParent may not be handled properly. Eric Newton why wasn't the following change made for 1.4? @@ -349,6 +349,9 @@ public class ZooLock implements Watcher { try { // set the watch on the parent node again zooKeeper.getStatus(path, this); watchingParent = true; + } catch (KeeperException.ConnectionLossException ex) { + // we can't look at the lock because we aren't connected, but our session is still good + log.warn("lost connection to zookeeper"); } catch (Exception ex) { if (lock != null || asyncLock != null) { lockWatcher.unableToMonitorLockNode(ex);
          Hide
          Eric Newton added a comment -

          Keith Turner I don't know why the change wasn't applied to the 1.4 branch.

          From observation, it looks like the loggers still go down when a zookeeper node goes down, even with this fix. The other servers stay up if they are able to reconnect in a timely fashion.

          So this is still not fixed in 1.4.

          Show
          Eric Newton added a comment - Keith Turner I don't know why the change wasn't applied to the 1.4 branch. From observation, it looks like the loggers still go down when a zookeeper node goes down, even with this fix. The other servers stay up if they are able to reconnect in a timely fashion. So this is still not fixed in 1.4.
          Hide
          ASF subversion and git services added a comment -

          Commit 4ed51ecbca7d4120c5c31531ecbebb5d56a7b79f in branch refs/heads/1.4.4-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=4ed51ec ]

          ACCUMULO-1572 apply missing patch; prevent logger from killing itself on a Disconnect event

          Show
          ASF subversion and git services added a comment - Commit 4ed51ecbca7d4120c5c31531ecbebb5d56a7b79f in branch refs/heads/1.4.4-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=4ed51ec ] ACCUMULO-1572 apply missing patch; prevent logger from killing itself on a Disconnect event
          Hide
          ASF subversion and git services added a comment -

          Commit 4ed51ecbca7d4120c5c31531ecbebb5d56a7b79f in branch refs/heads/1.5.1-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=4ed51ec ]

          ACCUMULO-1572 apply missing patch; prevent logger from killing itself on a Disconnect event

          Show
          ASF subversion and git services added a comment - Commit 4ed51ecbca7d4120c5c31531ecbebb5d56a7b79f in branch refs/heads/1.5.1-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=4ed51ec ] ACCUMULO-1572 apply missing patch; prevent logger from killing itself on a Disconnect event

            People

            • Assignee:
              Eric Newton
              Reporter:
              Eric Newton
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development