Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.0
    • Component/s: master
    • Labels:
    • Environment:

      1.6.0 sha 417902e218c566333b6ea5ac492186ae305e5e16

      Description

      I have a test that brings accumulo down hard after a minute and then brings it back up again. I was running it overnight and I saw this stack trace once. Not sure if it's a problem or not though.

      org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/617ee3a7-98b9-4f5f-af13-8894afe7c33c/dead/tservers/10.10.1.148:9997
      	org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/617ee3a7-98b9-4f5f-af13-8894afe7c33c/dead/tservers/10.10.1.148:9997
      		at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
      		at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      		at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
      		at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180)
      		at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:45)
      		at org.apache.accumulo.server.master.state.DeadServerList.getList(DeadServerList.java:52)
      		at org.apache.accumulo.master.MasterClientServiceHandler.getMasterStats(MasterClientServiceHandler.java:268)
      		at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      		at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      		at java.lang.reflect.Method.invoke(Method.java:597)
      		at org.apache.accumulo.trace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:63)
      		at com.sun.proxy.$Proxy11.getMasterStats(Unknown Source)
      		at org.apache.accumulo.core.master.thrift.MasterClientService$Processor$getMasterStats.getResult(MasterClientService.java:1414)
      		at org.apache.accumulo.core.master.thrift.MasterClientService$Processor$getMasterStats.getResult(MasterClientService.java:1398)
      		at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
      		at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
      		at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:171)
      		at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
      		at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
      		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
      		at java.lang.Thread.run(Thread.java:662)
      1. ACCUMULO-2154.v1.patch.txt
        6 kB
        Vikram Srivastava
      2. ACCUMULO-2154.v2.patch.txt
        2 kB
        Vikram Srivastava

        Activity

        Hide
        ASF subversion and git services added a comment -

        Commit ecdd8528ddf0108919ec48dc545e765c2c3aa364 in branch refs/heads/master from Vikram Srivastava
        [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=ecdd852 ]

        ACCUMULO-2154 Ignore NoNodeException while getting DeadServerList

        Signed-off-by: Keith Turner <kturner@apache.org>

        Show
        ASF subversion and git services added a comment - Commit ecdd8528ddf0108919ec48dc545e765c2c3aa364 in branch refs/heads/master from Vikram Srivastava [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=ecdd852 ] ACCUMULO-2154 Ignore NoNodeException while getting DeadServerList Signed-off-by: Keith Turner <kturner@apache.org>
        Hide
        ASF subversion and git services added a comment -

        Commit ecdd8528ddf0108919ec48dc545e765c2c3aa364 in branch refs/heads/1.6.0-SNAPSHOT from Vikram Srivastava
        [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=ecdd852 ]

        ACCUMULO-2154 Ignore NoNodeException while getting DeadServerList

        Signed-off-by: Keith Turner <kturner@apache.org>

        Show
        ASF subversion and git services added a comment - Commit ecdd8528ddf0108919ec48dc545e765c2c3aa364 in branch refs/heads/1.6.0-SNAPSHOT from Vikram Srivastava [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=ecdd852 ] ACCUMULO-2154 Ignore NoNodeException while getting DeadServerList Signed-off-by: Keith Turner <kturner@apache.org>
        Hide
        Vikram Srivastava added a comment -

        Attached patch where this exception is ignored.

        Show
        Vikram Srivastava added a comment - Attached patch where this exception is ignored.
        Hide
        Keith Turner added a comment -

        I was under the impression that only master accesses the dead TServers znode.

        I think that assumption may be correct (not 100% on this) for now. I was just thinking that using zookeeper methods for dealing with concurrency will make the code more resilient to change in the future. Its difficult to test the assumption that only the master modifies the znode and prevent future regressions.

        Show
        Keith Turner added a comment - I was under the impression that only master accesses the dead TServers znode. I think that assumption may be correct (not 100% on this) for now. I was just thinking that using zookeeper methods for dealing with concurrency will make the code more resilient to change in the future. Its difficult to test the assumption that only the master modifies the znode and prevent future regressions.
        Hide
        Vikram Srivastava added a comment -

        Thanks for the explanation Keith Turner. I was under the impression that only master accesses the dead TServers znode. But if that's not the case what you said makes sense. I'll submit another patch that ignores the NoNodeException.

        Show
        Vikram Srivastava added a comment - Thanks for the explanation Keith Turner . I was under the impression that only master accesses the dead TServers znode. But if that's not the case what you said makes sense. I'll submit another patch that ignores the NoNodeException.
        Hide
        Keith Turner added a comment -

        I realize that synchronization will fix this if we assume only one process and one object in that process access this resource. But simply catching and ignoring no-node exception will also fix the problem w/o those assumptions. Synchronization is great when the resource being protected is private process memory, however thats not true in this case. ZooKeeper is a cluster wide resource and its possible that any other process in the cluster could mutate zookeeper at any time. The way I see there are at least three options to solve this problem.

        1. use java synchronization with assumptions stated above
        2. use zookeeper primitives for dealing with concurrency
        3. use java synchronization and zookeeper primitives for dealing with concurrency

        I am in favor of #2. And its also very simple in this case, just ignore NoNodeException because it indicates the node was deleted after the call to getChildren() was made.

        I've made post() also synchronized so that getList() doesn't miss any dead servers that get added after zoo.getChildren() is called.

        It will still miss those servers. When synchronized, If one thread is in getList() then another thread calling post() will block. Detecting changes after getChildren is called is not needed, just need a consistent snapshot at a point in time. It could be achieved by checking getChildren in a loop and waiting for it stabilize. But the data could still be outdated by other operations immediately after getList() returns, so the code still has to treat it as a snapshot. Anything more would require some sort of transaction semantics across method calls, which is not needed.

        Show
        Keith Turner added a comment - I realize that synchronization will fix this if we assume only one process and one object in that process access this resource. But simply catching and ignoring no-node exception will also fix the problem w/o those assumptions. Synchronization is great when the resource being protected is private process memory, however thats not true in this case. ZooKeeper is a cluster wide resource and its possible that any other process in the cluster could mutate zookeeper at any time. The way I see there are at least three options to solve this problem. use java synchronization with assumptions stated above use zookeeper primitives for dealing with concurrency use java synchronization and zookeeper primitives for dealing with concurrency I am in favor of #2. And its also very simple in this case, just ignore NoNodeException because it indicates the node was deleted after the call to getChildren() was made. I've made post() also synchronized so that getList() doesn't miss any dead servers that get added after zoo.getChildren() is called. It will still miss those servers. When synchronized, If one thread is in getList() then another thread calling post() will block. Detecting changes after getChildren is called is not needed, just need a consistent snapshot at a point in time. It could be achieved by checking getChildren in a loop and waiting for it stabilize. But the data could still be outdated by other operations immediately after getList() returns, so the code still has to treat it as a snapshot. Anything more would require some sort of transaction semantics across method calls, which is not needed.
        Hide
        Vikram Srivastava added a comment -

        getList() getting called by multiple processes isn't the issue here. This exception comes happens MasterClientServiceHandler is calling getList and at the same time, one of the other 2 call sites call delete. So we do need to ensure getList() exits before delete() is called. I've made post() also synchronized so that getList() doesn't miss any dead servers that get added after zoo.getChildren() is called.

        Show
        Vikram Srivastava added a comment - getList() getting called by multiple processes isn't the issue here. This exception comes happens MasterClientServiceHandler is calling getList and at the same time, one of the other 2 call sites call delete. So we do need to ensure getList() exits before delete() is called. I've made post() also synchronized so that getList() doesn't miss any dead servers that get added after zoo.getChildren() is called.
        Hide
        Keith Turner added a comment -

        Why not catch and ignore KeeperException$NoNodeException in DeadServerList.getList()? Seems like would only want to do this for children. Currently getList() is only called by a single process, but if it were called by multiple processes in the future then this would be a more robust solution.

        Show
        Keith Turner added a comment - Why not catch and ignore KeeperException$NoNodeException in DeadServerList.getList()? Seems like would only want to do this for children. Currently getList() is only called by a single process, but if it were called by multiple processes in the future then this would be a more robust solution.
        Hide
        Vikram Srivastava added a comment -

        Attached patch.

        Show
        Vikram Srivastava added a comment - Attached patch.
        Hide
        Vikram Srivastava added a comment -

        I think it's happening because DeadServerList.getList first gets the list of paths using zoo.getChildren and then iterates over them without any lock to ensure any path doesn't get deleted while the loop is running.

        Show
        Vikram Srivastava added a comment - I think it's happening because DeadServerList.getList first gets the list of paths using zoo.getChildren and then iterates over them without any lock to ensure any path doesn't get deleted while the loop is running.

          People

          • Assignee:
            Vikram Srivastava
            Reporter:
            John Vines
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development