Whirr
  1. Whirr
  2. WHIRR-314

HBase integration test can fail due to Thrift server race

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Component/s: None
    • Labels:
      None

      Description

      There is a race condition where the Thrift server comes up faster than the master, fails to connect (after trying 10 times), then shuts down for good. Both Andrei and I have seen this fail on Rackspace Cloud Servers.

      1. WHIRR-314.patch
        1 kB
        Tom White
      2. WHIRR-314.patch
        0.5 kB
        Tom White

        Activity

        Hide
        Tom White added a comment -

        Here's a stack trace from the thrift server node:

        2011-05-25 16:40:19,672 INFO org.apache.hadoop.hbase.client.HConnectionManager$TableServers: getMaster attempt 9 of 10 failed; no more retrying.
        java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/master
             at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:481)
             at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readMasterAddressOrThrow(ZooKeeperWrapper.java:377)
             at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getMaster(HConnectionManager.java:381)
             at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:78)
             at org.apache.hadoop.hbase.thrift.ThriftServer$HBaseHandler.<init>(ThriftServer.java:191)
             at org.apache.hadoop.hbase.thrift.ThriftServer.doMain(ThriftServer.java:817)
             at org.apache.hadoop.hbase.thrift.ThriftServer.main(ThriftServer.java:874)
        Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/master
             at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
             at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
             at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
             at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:477)
             ... 6 more
        2011-05-25 16:40:19,677 INFO org.apache.zookeeper.ZooKeeper: Session: 0x1302806aebc0001 closed
        2011-05-25 16:40:19,678 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <173-203-217-78.static.cloud-ips.com:2181:/hbase,org.apache.hadoop.hbase.client.HConnectionManage
        r>Closed connection with ZooKeeper; /hbase/root-region-server
        
        Show
        Tom White added a comment - Here's a stack trace from the thrift server node: 2011-05-25 16:40:19,672 INFO org.apache.hadoop.hbase.client.HConnectionManager$TableServers: getMaster attempt 9 of 10 failed; no more retrying. java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/master at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:481) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readMasterAddressOrThrow(ZooKeeperWrapper.java:377) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getMaster(HConnectionManager.java:381) at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:78) at org.apache.hadoop.hbase.thrift.ThriftServer$HBaseHandler.<init>(ThriftServer.java:191) at org.apache.hadoop.hbase.thrift.ThriftServer.doMain(ThriftServer.java:817) at org.apache.hadoop.hbase.thrift.ThriftServer.main(ThriftServer.java:874) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/master at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:477) ... 6 more 2011-05-25 16:40:19,677 INFO org.apache.zookeeper.ZooKeeper: Session: 0x1302806aebc0001 closed 2011-05-25 16:40:19,678 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <173-203-217-78.static.cloud-ips.com:2181:/hbase,org.apache.hadoop.hbase.client.HConnectionManage r>Closed connection with ZooKeeper; /hbase/root-region-server
        Hide
        Tom White added a comment -

        This patch fixes the problem by increasing the number of retries to 100. I ran the integration test and it passed.

        Show
        Tom White added a comment - This patch fixes the problem by increasing the number of retries to 100. I ran the integration test and it passed.
        Hide
        Andrei Savu added a comment -

        +1 and we need the same change for CDH HBase in services/cdh/src/main/resources/functions/configure_cdh_hbase.sh.

        Side note: later we should make sure that tests do not block forever and they fail after a reasonable amount of time (all the cleanup work is annoying).

        Show
        Andrei Savu added a comment - +1 and we need the same change for CDH HBase in services/cdh/src/main/resources/functions/configure_cdh_hbase.sh . Side note: later we should make sure that tests do not block forever and they fail after a reasonable amount of time (all the cleanup work is annoying).
        Hide
        Tom White added a comment -

        Updated patch which addresses Andrei's comment. I'm going to commit this now.

        Show
        Tom White added a comment - Updated patch which addresses Andrei's comment. I'm going to commit this now.
        Hide
        Tom White added a comment -

        I've just committed this.

        Show
        Tom White added a comment - I've just committed this.

          People

          • Assignee:
            Tom White
            Reporter:
            Tom White
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development