HBase
  1. HBase
  2. HBASE-9655

IntegrationTestMTTR can loop forever on improperly configured clusters

    Details

    • Type: Test Test
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.95.2
    • Fix Version/s: 0.98.0, 0.96.0
    • Component/s: test
    • Labels:
      None

      Description

      IntegrationTestMTTR has a retry loop that can run infinitely. For instance, running the test on a secure cluster as a user who lacks permissions to perform table actions can cause the this scenario. Add another loop counter and bail when a TimingCalable instance throws too many unexpected Exceptions.

      1. HBASE-9655.00.patch
        5 kB
        Nick Dimiduk
      2. HBASE-9655.01.patch
        8 kB
        Nick Dimiduk

        Activity

        Hide
        Nicolas Liochon added a comment -

        I've created HBASE-9685 for RetriesExhaustedException.

        Show
        Nicolas Liochon added a comment - I've created HBASE-9685 for RetriesExhaustedException.
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in hbase-0.96 #100 (See https://builds.apache.org/job/hbase-0.96/100/)
        HBASE-9655 IntegrationTestMTTR can loop forever on improperly configured clusters (ndimiduk: rev 1526733)

        • /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/mttr/IntegrationTestMTTR.java
        Show
        Hudson added a comment - SUCCESS: Integrated in hbase-0.96 #100 (See https://builds.apache.org/job/hbase-0.96/100/ ) HBASE-9655 IntegrationTestMTTR can loop forever on improperly configured clusters (ndimiduk: rev 1526733) /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/mttr/IntegrationTestMTTR.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-TRUNK #4566 (See https://builds.apache.org/job/HBase-TRUNK/4566/)
        HBASE-9655 IntegrationTestMTTR can loop forever on improperly configured clusters (ndimiduk: rev 1526729)

        • /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/mttr/IntegrationTestMTTR.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK #4566 (See https://builds.apache.org/job/HBase-TRUNK/4566/ ) HBASE-9655 IntegrationTestMTTR can loop forever on improperly configured clusters (ndimiduk: rev 1526729) /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/mttr/IntegrationTestMTTR.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #764 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/764/)
        HBASE-9655 IntegrationTestMTTR can loop forever on improperly configured clusters (ndimiduk: rev 1526729)

        • /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/mttr/IntegrationTestMTTR.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #764 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/764/ ) HBASE-9655 IntegrationTestMTTR can loop forever on improperly configured clusters (ndimiduk: rev 1526729) /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/mttr/IntegrationTestMTTR.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in hbase-0.96-hadoop2 #60 (See https://builds.apache.org/job/hbase-0.96-hadoop2/60/)
        HBASE-9655 IntegrationTestMTTR can loop forever on improperly configured clusters (ndimiduk: rev 1526733)

        • /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/mttr/IntegrationTestMTTR.java
        Show
        Hudson added a comment - SUCCESS: Integrated in hbase-0.96-hadoop2 #60 (See https://builds.apache.org/job/hbase-0.96-hadoop2/60/ ) HBASE-9655 IntegrationTestMTTR can loop forever on improperly configured clusters (ndimiduk: rev 1526733) /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/mttr/IntegrationTestMTTR.java
        Hide
        Nick Dimiduk added a comment -

        For what it's worth, the 6 hour runtime business appears related to HBASE-9665.

        Show
        Nick Dimiduk added a comment - For what it's worth, the 6 hour runtime business appears related to HBASE-9665 .
        Hide
        Nick Dimiduk added a comment -

        Thanks for the reviews Elliott, Stack.

        Show
        Nick Dimiduk added a comment - Thanks for the reviews Elliott, Stack.
        Hide
        Nick Dimiduk added a comment -

        That's the plan, yes. Sound good, boss?

        Show
        Nick Dimiduk added a comment - That's the plan, yes. Sound good, boss?
        Hide
        stack added a comment -

        This going into 0.96? +1 there.

        Show
        stack added a comment - This going into 0.96? +1 there.
        Hide
        Nick Dimiduk added a comment -

        I'm going to commit this momentarily. Speak now or forever hold your peace

        Show
        Nick Dimiduk added a comment - I'm going to commit this momentarily. Speak now or forever hold your peace
        Hide
        Nick Dimiduk added a comment -

        The 6 hour issue notwithstanding, I'll commit 01.patch tomorrow afternoon unless there are new objections.

        Show
        Nick Dimiduk added a comment - The 6 hour issue notwithstanding, I'll commit 01.patch tomorrow afternoon unless there are new objections.
        Hide
        Nick Dimiduk added a comment -

        I went back through these exceptions, looking at their call hierarchy; I'm happy with the selection as they are. Patch 2 ran successfully on my test rig, but this begs the question: should this really take 6 hours?

        Show
        Nick Dimiduk added a comment - I went back through these exceptions, looking at their call hierarchy; I'm happy with the selection as they are. Patch 2 ran successfully on my test rig, but this begs the question: should this really take 6 hours?
        Hide
        Elliott Clark added a comment -

        Aborting on all DoNotRetryIOEs is too generic I think. For instance, my last run aborted do to an UnknownScannerException.

        I'm on the fence with this one. Seems like any exception (other than the explicit ones you mentioned) that bubbles up the user could be un-expected and cause for a failure. but then again the chaos monkey is pretty extreme in this test so maybe that's not cause for a failure.

        +1 on whatever you think.

        Show
        Elliott Clark added a comment - Aborting on all DoNotRetryIOEs is too generic I think. For instance, my last run aborted do to an UnknownScannerException. I'm on the fence with this one. Seems like any exception (other than the explicit ones you mentioned) that bubbles up the user could be un-expected and cause for a failure. but then again the chaos monkey is pretty extreme in this test so maybe that's not cause for a failure. +1 on whatever you think.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12605062/HBASE-9655.01.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        +1 site. The mvn site goal succeeds with this patch.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12605062/HBASE-9655.01.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified tests. +1 hadoop1.0 . The patch compiles against the hadoop 1.0 profile. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/7377//console This message is automatically generated.
        Hide
        Nick Dimiduk added a comment -

        A new patch I'm testing out.

        Show
        Nick Dimiduk added a comment - A new patch I'm testing out.
        Hide
        Nick Dimiduk added a comment -

        Aborting on all DoNotRetryIOEs is too generic I think. For instance, my last run aborted do to an UnknownScannerException. For an application rolling over cluster woes, we'd expect the scanner to not be retried but a new scanner to be acquired, making this a non-terminal exception for the sake of this test.

        Of list of Exceptions derived from NoNotRetryIOE, I think only the following should be treated as fatal for the purposes of this test: AccessDeniedException, CoprocessorException, FatalConnectionException, InvalidFamilyOperationException, NamespaceExistException, NamespaceNotFoundException, NoSuchColumnFamilyException, TableExistsException, TableNotFoundException.

        That leaves the following exceptions which will be retried by the test harness up to the specified number of times: HBaseSnapshotException, LeaseException, NotAllMetaRegionsOnlineException, ScannerTimeoutException, TableNotDisabledException, TableNotEnabledException, UnknownScannerException.

        What say you?

        Show
        Nick Dimiduk added a comment - Aborting on all DoNotRetryIOEs is too generic I think. For instance, my last run aborted do to an UnknownScannerException. For an application rolling over cluster woes, we'd expect the scanner to not be retried but a new scanner to be acquired, making this a non-terminal exception for the sake of this test. Of list of Exceptions derived from NoNotRetryIOE, I think only the following should be treated as fatal for the purposes of this test: AccessDeniedException, CoprocessorException, FatalConnectionException, InvalidFamilyOperationException, NamespaceExistException, NamespaceNotFoundException, NoSuchColumnFamilyException, TableExistsException, TableNotFoundException. That leaves the following exceptions which will be retried by the test harness up to the specified number of times: HBaseSnapshotException, LeaseException, NotAllMetaRegionsOnlineException, ScannerTimeoutException, TableNotDisabledException, TableNotEnabledException, UnknownScannerException. What say you?
        Hide
        Nick Dimiduk added a comment -

        Ha! Do we have style guidelines against this?

        Show
        Nick Dimiduk added a comment - Ha! Do we have style guidelines against this?
        Hide
        Elliott Clark added a comment -
        (null != admin)

        Yoda conditions there are.

        Other than that nit. +1

        Show
        Elliott Clark added a comment - ( null != admin) Yoda conditions there are. Other than that nit. +1
        Hide
        Nick Dimiduk added a comment -

        I'm testing the patch now, attaching it here for review and the BuildBot. I took the liberty of cleaning up some compiler warnings while I was there; the meat of the fix is in TimingCallable#call().

        Show
        Nick Dimiduk added a comment - I'm testing the patch now, attaching it here for review and the BuildBot. I took the liberty of cleaning up some compiler warnings while I was there; the meat of the fix is in TimingCallable#call() .

          People

          • Assignee:
            Nick Dimiduk
            Reporter:
            Nick Dimiduk
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development