HBase
  1. HBase
  2. HBASE-5780

Fix race in HBase regionserver startup vs ZK SASL authentication

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.92.1, 0.94.0
    • Fix Version/s: 0.94.0, 0.95.0
    • Component/s: security
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Secure RegionServers sometimes fail to start with the following backtrace:

      2012-03-22 17:20:16,737 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server centos60-20.ent.cloudera.com,60020,1332462015929: Unexpected exception during initialization, aborting
      org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /hbase/shutdown
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:113)
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1131)
      at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:295)
      at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:518)
      at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:494)
      at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
      at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:569)
      at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:532)
      at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:634)
      at java.lang.Thread.run(Thread.java:662)

      1. TestReplicationPeer-Security-output.log
        6 kB
        Shaneal Manek
      2. TestReplicationPeer-output.log
        5 kB
        Shaneal Manek
      3. testoutput.tar.gz
        9 kB
        Shaneal Manek
      4. HBASE-5780-v2.patch
        0.9 kB
        Shaneal Manek
      5. HBASE-5780.patch
        0.9 kB
        Shaneal Manek

        Issue Links

          Activity

          Hide
          Shaneal Manek added a comment -

          Ensures that when we're using a secure ZK, we wait for the ZK watchers to receive the SaslAuthenticated message.

          Show
          Shaneal Manek added a comment - Ensures that when we're using a secure ZK, we wait for the ZK watchers to receive the SaslAuthenticated message.
          Hide
          Shaneal Manek added a comment -

          Just noticed they are fixing this in the ZK client (ZOOKEEPER-1437) eventually. We still need need to work around the ZK 'bug' in the interim though.

          Show
          Shaneal Manek added a comment - Just noticed they are fixing this in the ZK client ( ZOOKEEPER-1437 ) eventually. We still need need to work around the ZK 'bug' in the interim though.
          Hide
          Ted Yu added a comment -
          +    } catch (InterruptedException e) {
          +      LOG.error("Interrupted while waiting for the ZookeeperWatcher to authenticate", e);
          

          Is it safe to proceed with start() in the above case ?

          Show
          Ted Yu added a comment - + } catch (InterruptedException e) { + LOG.error( "Interrupted while waiting for the ZookeeperWatcher to authenticate" , e); Is it safe to proceed with start() in the above case ?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12522522/HBASE-5780.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.wal.TestHLogSplit
          org.apache.hadoop.hbase.replication.TestMultiSlaveReplication
          org.apache.hadoop.hbase.regionserver.wal.TestHLog
          org.apache.hadoop.hbase.replication.TestMasterReplication

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1504//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1504//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1504//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522522/HBASE-5780.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.wal.TestHLogSplit org.apache.hadoop.hbase.replication.TestMultiSlaveReplication org.apache.hadoop.hbase.regionserver.wal.TestHLog org.apache.hadoop.hbase.replication.TestMasterReplication Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1504//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1504//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1504//console This message is automatically generated.
          Hide
          Shaneal Manek added a comment -

          Reply to Ted: Yes, it's safe (i.e., the worst possible outcome is a NoAuthExceptio being thrown later in the start() method - not some sort of inconsistent internal state).

          The two obvious alternatives to logging the interruption and continuing are to retry waiting for auth to finish (which I very much dislike), or to throw an exception immediately.

          However, even if the Thread has been interrupted before authentication has completed, it's entirely possible everything else will go fine (for example the Master and ROOT znodes are world readable - so, in those cases, it will work with or without auth). And, if I needed the auth, I'll get a ZK NoAuthException anyways, so no real harm by not bailing out early.

          If you'd like me to change it to throw an exception immediately, I'd be happy to do so though.

          Show
          Shaneal Manek added a comment - Reply to Ted: Yes, it's safe (i.e., the worst possible outcome is a NoAuthExceptio being thrown later in the start() method - not some sort of inconsistent internal state). The two obvious alternatives to logging the interruption and continuing are to retry waiting for auth to finish (which I very much dislike), or to throw an exception immediately. However, even if the Thread has been interrupted before authentication has completed, it's entirely possible everything else will go fine (for example the Master and ROOT znodes are world readable - so, in those cases, it will work with or without auth). And, if I needed the auth, I'll get a ZK NoAuthException anyways, so no real harm by not bailing out early. If you'd like me to change it to throw an exception immediately, I'd be happy to do so though.
          Hide
          Ted Yu added a comment -

          I think throwing exception immediately is better.

          Please also run through test suite using '-Psecurity' since Hadoop QA doesn't test security profile. Let us know the test result.

          Thanks

          Show
          Ted Yu added a comment - I think throwing exception immediately is better. Please also run through test suite using '-Psecurity' since Hadoop QA doesn't test security profile. Let us know the test result. Thanks
          Hide
          Shaneal Manek added a comment -

          Throws an IllegalStateException on interruption. Currently running tests with the security profile (and will upload the results as soon as they finish).

          Show
          Shaneal Manek added a comment - Throws an IllegalStateException on interruption. Currently running tests with the security profile (and will upload the results as soon as they finish).
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12522612/HBASE-5780-v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.mapreduce.TestWALPlayer

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1517//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1517//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1517//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522612/HBASE-5780-v2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestWALPlayer Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1517//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1517//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1517//console This message is automatically generated.
          Hide
          Shaneal Manek added a comment -

          There are some unrelated flakey tests (but they are flakey with or without this patch). Output attached

          Show
          Shaneal Manek added a comment - There are some unrelated flakey tests (but they are flakey with or without this patch). Output attached
          Hide
          Ted Yu added a comment -

          Test results look good.
          Will integrate tomorrow if there is no objection.

          Show
          Ted Yu added a comment - Test results look good. Will integrate tomorrow if there is no objection.
          Hide
          Shaneal Manek added a comment -

          Thanks Zhihong! Would you mind applying the patch to the 0.92 branch too? (I only mention it since you only marked 0.96.0 and 0.94.1 as fix versions). I'm running into this problem on an 0.92.1 cluster.

          Show
          Shaneal Manek added a comment - Thanks Zhihong! Would you mind applying the patch to the 0.92 branch too? (I only mention it since you only marked 0.96.0 and 0.94.1 as fix versions). I'm running into this problem on an 0.92.1 cluster.
          Hide
          Ted Yu added a comment -

          0.92 builds have been failing 7 times, straight.
          Trunk builds have been failing 4 times consectively.

          Will integrate to 0.94 first.

          Show
          Ted Yu added a comment - 0.92 builds have been failing 7 times, straight. Trunk builds have been failing 4 times consectively. Will integrate to 0.94 first.
          Hide
          Ted Yu added a comment -

          Integrated to 0.94 branch.

          Waiting for 0.92 and trunk builds to pass before further integration.

          Thanks for the patch, Shaneal.

          Show
          Ted Yu added a comment - Integrated to 0.94 branch. Waiting for 0.92 and trunk builds to pass before further integration. Thanks for the patch, Shaneal.
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #115 (See https://builds.apache.org/job/HBase-0.94/115/)
          HBASE-5780 Fix race in HBase regionserver startup vs ZK SASL authentication (Shaneal) (Revision 1326101)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
          Show
          Hudson added a comment - Integrated in HBase-0.94 #115 (See https://builds.apache.org/job/HBase-0.94/115/ ) HBASE-5780 Fix race in HBase regionserver startup vs ZK SASL authentication (Shaneal) (Revision 1326101) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
          Hide
          Ted Yu added a comment -

          TestReplicationPeer#testResetZooKeeperSession failed on Jenkins and locally (on MacBook).
          I reverted the patch from 0.94 for further investigation.

          Show
          Ted Yu added a comment - TestReplicationPeer#testResetZooKeeperSession failed on Jenkins and locally (on MacBook). I reverted the patch from 0.94 for further investigation.
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #116 (See https://builds.apache.org/job/HBase-0.94/116/)
          HBASE-5780 revert due to TestReplicationPeer#testResetZooKeeperSession failure (Revision 1326122)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
          Show
          Hudson added a comment - Integrated in HBase-0.94 #116 (See https://builds.apache.org/job/HBase-0.94/116/ ) HBASE-5780 revert due to TestReplicationPeer#testResetZooKeeperSession failure (Revision 1326122) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
          Hide
          Shaneal Manek added a comment -

          Odd, that test seems to always pass for me (tested on 0.94 and 0.92). How are you running your tests locally?

          Both of the following pass for me (output attached too):

          mvn -PlocalTests -Dtest=TestReplicationPeer -Psecurity clean test
          mvn -PlocalTests -Dtest=TestReplicationPeer clean test
          

          I'm looking into it in more detail now too.

          Show
          Shaneal Manek added a comment - Odd, that test seems to always pass for me (tested on 0.94 and 0.92). How are you running your tests locally? Both of the following pass for me (output attached too): mvn -PlocalTests -Dtest=TestReplicationPeer -Psecurity clean test mvn -PlocalTests -Dtest=TestReplicationPeer clean test I'm looking into it in more detail now too.
          Hide
          Ted Yu added a comment -

          I couldn't reproduce the test failure on MacBook.

          Integrated to 0.94 again.

          Show
          Ted Yu added a comment - I couldn't reproduce the test failure on MacBook. Integrated to 0.94 again.
          Hide
          Ted Yu added a comment -

          Integrated to 0.92 and trunk as well.

          Thanks for the patch, Shaneal.

          Show
          Ted Yu added a comment - Integrated to 0.92 and trunk as well. Thanks for the patch, Shaneal.
          Hide
          Shaneal Manek added a comment -

          Thanks for the help Zhihong!

          Show
          Shaneal Manek added a comment - Thanks for the help Zhihong!
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2770 (See https://builds.apache.org/job/HBase-TRUNK/2770/)
          HBASE-5780 Fix race in HBase regionserver startup vs ZK SASL authentication (Shaneal) (Revision 1326814)

          Result = SUCCESS
          tedyu :
          Files :

          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2770 (See https://builds.apache.org/job/HBase-TRUNK/2770/ ) HBASE-5780 Fix race in HBase regionserver startup vs ZK SASL authentication (Shaneal) (Revision 1326814) Result = SUCCESS tedyu : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #121 (See https://builds.apache.org/job/HBase-0.94/121/)
          HBASE-5780 Fix race in HBase regionserver startup vs ZK SASL authentication (Shaneal) (Revision 1326810)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
          Show
          Hudson added a comment - Integrated in HBase-0.94 #121 (See https://builds.apache.org/job/HBase-0.94/121/ ) HBASE-5780 Fix race in HBase regionserver startup vs ZK SASL authentication (Shaneal) (Revision 1326810) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
          Hide
          Lars Hofhansl added a comment -

          RC1 was sunk, so this is against 0.94.0 again.

          Show
          Lars Hofhansl added a comment - RC1 was sunk, so this is against 0.94.0 again.
          Hide
          Lars Hofhansl added a comment -

          testResetZooKeeperSession failed in 0.94 again.

          Show
          Lars Hofhansl added a comment - testResetZooKeeperSession failed in 0.94 again.
          Hide
          Ted Yu added a comment -

          In build #122 (https://builds.apache.org/job/HBase-0.94/122/console), the test passed:

          Running org.apache.hadoop.hbase.replication.TestReplicationPeer
          Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.67 sec
          
          Show
          Ted Yu added a comment - In build #122 ( https://builds.apache.org/job/HBase-0.94/122/console ), the test passed: Running org.apache.hadoop.hbase.replication.TestReplicationPeer Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.67 sec
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #374 (See https://builds.apache.org/job/HBase-0.92/374/)
          HBASE-5780 Fix race in HBase regionserver startup vs ZK SASL authentication (Shaneal) (Revision 1326815)

          Result = SUCCESS
          tedyu :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
          Show
          Hudson added a comment - Integrated in HBase-0.92 #374 (See https://builds.apache.org/job/HBase-0.92/374/ ) HBASE-5780 Fix race in HBase regionserver startup vs ZK SASL authentication (Shaneal) (Revision 1326815) Result = SUCCESS tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-security #173 (See https://builds.apache.org/job/HBase-TRUNK-security/173/)
          HBASE-5780 Fix race in HBase regionserver startup vs ZK SASL authentication (Shaneal) (Revision 1326814)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-security #173 (See https://builds.apache.org/job/HBase-TRUNK-security/173/ ) HBASE-5780 Fix race in HBase regionserver startup vs ZK SASL authentication (Shaneal) (Revision 1326814) Result = FAILURE tedyu : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
          Hide
          Eugene Koontz added a comment -

          I tried:

          mvn clean ; while [ "$?" -eq "0" ] ; do mvn -PrunDevTests,localTests,security test -Dtest=TestReplicationPeer ; done
          

          and the test consistently passes for me on both trunk and 0.94 branches.

          Show
          Eugene Koontz added a comment - I tried: mvn clean ; while [ "$?" -eq "0" ] ; do mvn -PrunDevTests,localTests,security test -Dtest=TestReplicationPeer ; done and the test consistently passes for me on both trunk and 0.94 branches.
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security #13 (See https://builds.apache.org/job/HBase-0.94-security/13/)
          HBASE-5780 Fix race in HBase regionserver startup vs ZK SASL authentication (Shaneal) (Revision 1326810)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security #13 (See https://builds.apache.org/job/HBase-0.94-security/13/ ) HBASE-5780 Fix race in HBase regionserver startup vs ZK SASL authentication (Shaneal) (Revision 1326810) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92-security #105 (See https://builds.apache.org/job/HBase-0.92-security/105/)
          HBASE-5780 Fix race in HBase regionserver startup vs ZK SASL authentication (Shaneal) (Revision 1326815)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
          Show
          Hudson added a comment - Integrated in HBase-0.92-security #105 (See https://builds.apache.org/job/HBase-0.92-security/105/ ) HBASE-5780 Fix race in HBase regionserver startup vs ZK SASL authentication (Shaneal) (Revision 1326815) Result = FAILURE tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java

            People

            • Assignee:
              Shaneal Manek
              Reporter:
              Shaneal Manek
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development