HBase
  1. HBase
  2. HBASE-4690

Intermittent TestRegionServerCoprocessorExceptionWithAbort#testExceptionFromCoprocessorDuringPut failure

    Details

    • Type: Test Test
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.92.0
    • Fix Version/s: 0.92.0
    • Component/s: None
    • Labels:
      None

      Description

      See https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/83/testReport/junit/org.apache.hadoop.hbase.coprocessor/TestRegionServerCoprocessorExceptionWithAbort/testExceptionFromCoprocessorDuringPut/

      Somehow getRSForFirstRegionInTable() wasn't able to retrieve the region server.

      One fix for this issue is to spin up MiniCluster with 1 region server so that we don't need to search for the region server where first region is hosted.

        Issue Links

          Activity

          Hide
          Eugene Koontz added a comment -

          transitively related by HBASE-4798.

          Show
          Eugene Koontz added a comment - transitively related by HBASE-4798 .
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #90 (See https://builds.apache.org/job/HBase-0.92/90/)
          HBASE-4690 Intermittent TestRegionServerCoprocessorExceptionWithAbort failure

          tedyu :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
          Show
          Hudson added a comment - Integrated in HBase-0.92 #90 (See https://builds.apache.org/job/HBase-0.92/90/ ) HBASE-4690 Intermittent TestRegionServerCoprocessorExceptionWithAbort failure tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
          Hide
          Ted Yu added a comment -

          TRUNK build 2390 has passed.

          Will open new issue if test(s) fail in the future.

          Show
          Ted Yu added a comment - TRUNK build 2390 has passed. Will open new issue if test(s) fail in the future.
          Hide
          Eugene Koontz added a comment -

          Thanks for the fix Ted; it looks good to me too. Will be watching trunk builds; hope this fixes it.

          Show
          Eugene Koontz added a comment - Thanks for the fix Ted; it looks good to me too. Will be watching trunk builds; hope this fixes it.
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2388 (See https://builds.apache.org/job/HBase-TRUNK/2388/)
          HBASE-4690 Intermittent TestRegionServerCoprocessorExceptionWithAbort failure

          tedyu :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2388 (See https://builds.apache.org/job/HBase-TRUNK/2388/ ) HBASE-4690 Intermittent TestRegionServerCoprocessorExceptionWithAbort failure tedyu : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
          Hide
          Ted Yu added a comment -

          Integrate to 0.92 and TRUNK.

          Thanks for the review Stack.

          Show
          Ted Yu added a comment - Integrate to 0.92 and TRUNK. Thanks for the review Stack.
          Hide
          stack added a comment -

          +1 on trying the patch. Commit it I'd say. We can resolve later if we go a period w/o these failures.

          Show
          stack added a comment - +1 on trying the patch. Commit it I'd say. We can resolve later if we go a period w/o these failures.
          Hide
          Ted Yu added a comment -

          Except for TestAdmin#testEnableDisableAddColumnDeleteColumn, the other test failures were due to 'Too many open files'

          I ran TestAdmin#testEnableDisableAddColumnDeleteColumn a few times and didn't encounter test failure.

          Show
          Ted Yu added a comment - Except for TestAdmin#testEnableDisableAddColumnDeleteColumn, the other test failures were due to 'Too many open files' I ran TestAdmin#testEnableDisableAddColumnDeleteColumn a few times and didn't encounter test failure.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12501496/4690-trunk.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -166 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.master.TestMasterFailover
          org.apache.hadoop.hbase.client.TestAdmin
          org.apache.hadoop.hbase.master.TestDistributedLogSplitting

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/104//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/104//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/104//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12501496/4690-trunk.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -166 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.master.TestMasterFailover org.apache.hadoop.hbase.client.TestAdmin org.apache.hadoop.hbase.master.TestDistributedLogSplitting Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/104//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/104//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/104//console This message is automatically generated.
          Hide
          Ted Yu added a comment -

          Proposed patch for TRUNK.

          Show
          Ted Yu added a comment - Proposed patch for TRUNK.
          Hide
          Ted Yu added a comment - - edited

          It is pretty clear what happened in build 2384. The failure was because regions were brought online not in the same order as start keys are defined:

          2011-10-29 21:45:43,789 INFO  [RS_OPEN_REGION-hemera.apache.org,37045,1319924731557-0] regionserver.HRegion(502): Onlined observed_table,kkk,1319924743536.d2bb03652b0e69a4a192be3b60f6cd78.; next sequenceid=1
          ...
          2011-10-29 21:45:43,883 DEBUG [Thread-183] hbase.HBaseTestingUtility(1129): Found 25 rows for table observed_table
          2011-10-29 21:45:43,883 DEBUG [Thread-183] hbase.HBaseTestingUtility(1132): FirstRow=observed_table,,1319924743504.ed6d9b9f5122809fad16e61835367b48.
          2011-10-29 21:45:43,887 INFO  [RS_OPEN_REGION-hemera.apache.org,37045,1319924731557-1] regionserver.HRegion(502): Onlined observed_table,lll,1319924743540.ac163536355dbe1ab71ab1a9ee7a22d4.; next sequenceid=1
          ...
          2011-10-29 21:45:43,950 DEBUG [main-EventThread] zookeeper.ZKUtil(228): master:34047-0x13351a50a270000 Set watcher on existing znode /hbase/unassigned/ed6d9b9f5122809fad16e61835367b48
          ...
          2011-10-29 21:45:44,050 INFO  [RS_OPEN_REGION-hemera.apache.org,45759,1319924731527-0] regionserver.HRegion(502): Onlined observed_table,,1319924743504.ed6d9b9f5122809fad16e61835367b48.; next sequenceid=1
          

          We can see the ~170ms delay between the discovery of region 1319924743504.ed6d9b9f5122809fad16e61835367b48. and its actual online.

          A simple patch would be to give getRSForFirstRegionInTable() some time if index returned by hbaseCluster.getServerWith() was -1.

          Show
          Ted Yu added a comment - - edited It is pretty clear what happened in build 2384. The failure was because regions were brought online not in the same order as start keys are defined: 2011-10-29 21:45:43,789 INFO [RS_OPEN_REGION-hemera.apache.org,37045,1319924731557-0] regionserver.HRegion(502): Onlined observed_table,kkk,1319924743536.d2bb03652b0e69a4a192be3b60f6cd78.; next sequenceid=1 ... 2011-10-29 21:45:43,883 DEBUG [ Thread -183] hbase.HBaseTestingUtility(1129): Found 25 rows for table observed_table 2011-10-29 21:45:43,883 DEBUG [ Thread -183] hbase.HBaseTestingUtility(1132): FirstRow=observed_table,,1319924743504.ed6d9b9f5122809fad16e61835367b48. 2011-10-29 21:45:43,887 INFO [RS_OPEN_REGION-hemera.apache.org,37045,1319924731557-1] regionserver.HRegion(502): Onlined observed_table,lll,1319924743540.ac163536355dbe1ab71ab1a9ee7a22d4.; next sequenceid=1 ... 2011-10-29 21:45:43,950 DEBUG [main-EventThread] zookeeper.ZKUtil(228): master:34047-0x13351a50a270000 Set watcher on existing znode /hbase/unassigned/ed6d9b9f5122809fad16e61835367b48 ... 2011-10-29 21:45:44,050 INFO [RS_OPEN_REGION-hemera.apache.org,45759,1319924731527-0] regionserver.HRegion(502): Onlined observed_table,,1319924743504.ed6d9b9f5122809fad16e61835367b48.; next sequenceid=1 We can see the ~170ms delay between the discovery of region 1319924743504.ed6d9b9f5122809fad16e61835367b48. and its actual online. A simple patch would be to give getRSForFirstRegionInTable() some time if index returned by hbaseCluster.getServerWith() was -1.
          Hide
          Eugene Koontz added a comment -

          Perhaps the patch to HBASE-4701 will be useful for HBASE-4690 also.

          Show
          Eugene Koontz added a comment - Perhaps the patch to HBASE-4701 will be useful for HBASE-4690 also.
          Hide
          stack added a comment -

          It just failed on jenkins:

          -------------------------------------------------------------------------------
          Test set: org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort
          -------------------------------------------------------------------------------
          Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 23.006 sec <<< FAILURE!
          testExceptionFromCoprocessorDuringPut(org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort)  Time elapsed: 1.476 sec  <<< ERROR!
          java.lang.ArrayIndexOutOfBoundsException: -1
          	at java.util.concurrent.CopyOnWriteArrayList.get(CopyOnWriteArrayList.java:343)
          	at java.util.Collections$UnmodifiableList.get(Collections.java:1154)
          	at org.apache.hadoop.hbase.HBaseTestingUtility.getRSForFirstRegionInTable(HBaseTestingUtility.java:1134)
          	at org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort.testExceptionFromCoprocessorDuringPut(TestRegionServerCoprocessorExceptionWithAbort.java:81)
          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          	at java.lang.reflect.Method.invoke(Method.java:597)
          	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
          	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
          	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
          	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
          	at org.junit.internal.runners.statements.FailOnTimeout$1.run(FailOnTimeout.java:28)
          

          The logging I added shows:

          2011-10-29 21:45:43,883 DEBUG [Thread-183] hbase.HBaseTestingUtility(1129): Found 25 rows for table observed_table
          2011-10-29 21:45:43,883 DEBUG [Thread-183] hbase.HBaseTestingUtility(1132): FirstRow=observed_table,,1319924743504.ed6d9b9f5122809fad16e61835367b48.
          

          ... which would seem to say that there are plenty of regions in the table and that the first region doesn't have an associated server.. I'm guessing. Need to look back in this log to verify:

          https://builds.apache.org/job/HBase-TRUNK/2384/artifact/trunk/target/surefire-reports/org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort-output.txt

          https://builds.apache.org/job/HBase-TRUNK/2384/artifact/trunk/target/surefire-reports/org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort.txt

          TODO

          Show
          stack added a comment - It just failed on jenkins: ------------------------------------------------------------------------------- Test set: org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort ------------------------------------------------------------------------------- Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 23.006 sec <<< FAILURE! testExceptionFromCoprocessorDuringPut(org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort) Time elapsed: 1.476 sec <<< ERROR! java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.concurrent.CopyOnWriteArrayList.get(CopyOnWriteArrayList.java:343) at java.util.Collections$UnmodifiableList.get(Collections.java:1154) at org.apache.hadoop.hbase.HBaseTestingUtility.getRSForFirstRegionInTable(HBaseTestingUtility.java:1134) at org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort.testExceptionFromCoprocessorDuringPut(TestRegionServerCoprocessorExceptionWithAbort.java:81) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.FailOnTimeout$1.run(FailOnTimeout.java:28) The logging I added shows: 2011-10-29 21:45:43,883 DEBUG [ Thread -183] hbase.HBaseTestingUtility(1129): Found 25 rows for table observed_table 2011-10-29 21:45:43,883 DEBUG [ Thread -183] hbase.HBaseTestingUtility(1132): FirstRow=observed_table,,1319924743504.ed6d9b9f5122809fad16e61835367b48. ... which would seem to say that there are plenty of regions in the table and that the first region doesn't have an associated server.. I'm guessing. Need to look back in this log to verify: https://builds.apache.org/job/HBase-TRUNK/2384/artifact/trunk/target/surefire-reports/org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort-output.txt https://builds.apache.org/job/HBase-TRUNK/2384/artifact/trunk/target/surefire-reports/org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort.txt TODO
          Hide
          stack added a comment -

          I've been trying it on an Ubuntu 10.04 and couldn't make it fail after 80 runs. Will need to dig in on next jenkins fail.

          Show
          stack added a comment - I've been trying it on an Ubuntu 10.04 and couldn't make it fail after 80 runs. Will need to dig in on next jenkins fail.
          Hide
          Eugene Koontz added a comment -

          Has some useful scripts (as recommended by Ted Yu) for testing intermittent failures.

          Show
          Eugene Koontz added a comment - Has some useful scripts (as recommended by Ted Yu) for testing intermittent failures.
          Hide
          Ted Yu added a comment -

          Apache Jenkins runs on Ubuntu.
          Can you find a Ubuntu box ?

          Otherwise we have to read through the log for TRUNK build 2379.

          If we cannot figure out the root cause, I suggest using 1 region server in minicluster.

          Show
          Ted Yu added a comment - Apache Jenkins runs on Ubuntu. Can you find a Ubuntu box ? Otherwise we have to read through the log for TRUNK build 2379. If we cannot figure out the root cause, I suggest using 1 region server in minicluster.
          Hide
          Eugene Koontz added a comment -

          Hi Ted,
          I ran "sh runtest-no-npe-check.sh 20 TestRegionServerCoprocessorExceptionWithAbort TestRegionServerCoprocessorExceptionWithRemove" but no failures.

          Show
          Eugene Koontz added a comment - Hi Ted, I ran "sh runtest-no-npe-check.sh 20 TestRegionServerCoprocessorExceptionWithAbort TestRegionServerCoprocessorExceptionWithRemove" but no failures.
          Hide
          Ted Yu added a comment -

          You can use runtest-no-npe-check.sh from HBASE-4480.

          Alternatively if you run test suite, you should see some test failure soon:

          Running org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort
          Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 20.406 sec
          Running org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithRemove
          Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.116 sec <<< FAILURE!
          
          Show
          Ted Yu added a comment - You can use runtest-no-npe-check.sh from HBASE-4480 . Alternatively if you run test suite, you should see some test failure soon: Running org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 20.406 sec Running org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithRemove Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.116 sec <<< FAILURE!
          Hide
          Eugene Koontz added a comment -

          @Ted, thanks for runtest.sh, but the output is a bit misleading. TestRegionServerCoprocessorExceptionWithRemove and TestRegionServerCoprocessorExceptionWithAbort intentionally cause NPEs to test how the NPEs are handled. runtest.sh greps and finds the NPE and echos it, even if the tests pass.

          Show
          Eugene Koontz added a comment - @Ted, thanks for runtest.sh, but the output is a bit misleading. TestRegionServerCoprocessorExceptionWithRemove and TestRegionServerCoprocessorExceptionWithAbort intentionally cause NPEs to test how the NPEs are handled. runtest.sh greps and finds the NPE and echos it, even if the tests pass.
          Hide
          Eugene Koontz added a comment -

          Original source of the tests mentioned in this JIRA.

          Show
          Eugene Koontz added a comment - Original source of the tests mentioned in this JIRA.
          Hide
          Ted Yu added a comment -

          @Eugene:
          Please use runtest.sh from HBASE-4480 to run the tests in succession.
          The following command runs two tests in succession for 3 times:

          ./runtest.sh 3 TestRegionServerCoprocessorExceptionWithAbort TestRegionServerCoprocessorExceptionWithRemove
          
          Show
          Ted Yu added a comment - @Eugene: Please use runtest.sh from HBASE-4480 to run the tests in succession. The following command runs two tests in succession for 3 times: ./runtest.sh 3 TestRegionServerCoprocessorExceptionWithAbort TestRegionServerCoprocessorExceptionWithRemove
          Hide
          Eugene Koontz added a comment -

          No failures @b04234f152a on either test after 100+ iterations, but investigating why TestRegionServerCoprocessorExceptionWithAbort is failing in https://builds.apache.org/view/G-L/view/HBase/job/HBase-TRUNK/2379/console.

          Show
          Eugene Koontz added a comment - No failures @b04234f152a on either test after 100+ iterations, but investigating why TestRegionServerCoprocessorExceptionWithAbort is failing in https://builds.apache.org/view/G-L/view/HBase/job/HBase-TRUNK/2379/console .
          Hide
          Eugene Koontz added a comment -

          Hi Stack, I am up to date now (b04234f152a) and no failures with either TestRegionServerCoprocessorExceptionWithAbort or TestRegionServerCoprocessorExceptionWithRemove. I'm looking for any intermittent failures though by running a script to repeating these tests over and over today.

          I think you are right about consolidating these two tests - no reason to spin up a cluster twice. I'll open a new JIRA for that.

          Show
          Eugene Koontz added a comment - Hi Stack, I am up to date now (b04234f152a) and no failures with either TestRegionServerCoprocessorExceptionWithAbort or TestRegionServerCoprocessorExceptionWithRemove. I'm looking for any intermittent failures though by running a script to repeating these tests over and over today. I think you are right about consolidating these two tests - no reason to spin up a cluster twice. I'll open a new JIRA for that.
          Hide
          stack added a comment -

          @Eugene Are you up to date? The HBase-4300 signal might be false (I broke everything yesterday w/ that commit). Do they fail for you? I can't make them fail for me on my HW.

          Show
          stack added a comment - @Eugene Are you up to date? The HBase-4300 signal might be false (I broke everything yesterday w/ that commit). Do they fail for you? I can't make them fail for me on my HW.
          Hide
          Eugene Koontz added a comment -

          Doing git bisect, it seems that both TestRegionServerCoprocessorExceptionWithRemove and TestRegionServerCoprocessorExceptionWithAbort stop working at :

          306ab94... HBASE-4300 Start of new-version master fails if old master's znode is hanging around

          Show
          Eugene Koontz added a comment - Doing git bisect, it seems that both TestRegionServerCoprocessorExceptionWithRemove and TestRegionServerCoprocessorExceptionWithAbort stop working at : 306ab94... HBASE-4300 Start of new-version master fails if old master's znode is hanging around
          Hide
          stack added a comment -

          I added debug so next time it fails on jenkins, hopefully a bit more info.

          There are two tests named TestRegionServerCoprocessor* that have the same form that each spin up a cluster to do one thing (We should consider putting them together). Both seem to fail here checking for first region. I tried to take a look at it and it doesn't make sense. Just before the outofbounds we've checked the list is not empty.

          Lets see if the debugging added helps.

          Show
          stack added a comment - I added debug so next time it fails on jenkins, hopefully a bit more info. There are two tests named TestRegionServerCoprocessor* that have the same form that each spin up a cluster to do one thing (We should consider putting them together). Both seem to fail here checking for first region. I tried to take a look at it and it doesn't make sense. Just before the outofbounds we've checked the list is not empty. Lets see if the debugging added helps.

            People

            • Assignee:
              Ted Yu
              Reporter:
              Ted Yu
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development