HBase
  1. HBase
  2. HBASE-4842

[hbck] Fix intermittent failures on TestHBaseFsck.testHBaseFsck

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.90.4, 0.92.0, 0.94.0
    • Fix Version/s: 0.92.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Tags:
      hbck

      Description

      Its seems that on the 0.92 branch in particular, TestHBaseFsck.testHBaseFsck is intermittently failing.

      In the test, a region's assignment is purposely changed in META but not in ZK. After the equivalent of 'hbck -fix', a subsequent check that should be clean comes up with a new ZK assignment but with META still being inconsistent with ZK. The RS in ZK sometimes this points to the same RS, but sometimes it "moves" to another ZK.

      1. hbase-4842-breaker.patch
        1 kB
        Jonathan Hsieh
      2. hbase-4842.patch
        0.7 kB
        Jonathan Hsieh
      3. 4842-v3.txt
        1 kB
        stack

        Issue Links

          Activity

          Hide
          Jonathan Hsieh added a comment -

          Output Examples:

          Note that the ZK assignment and the META assignment did not change.

          // hbck -fix call
          ERROR: Region tableBadMetaAssign,,1321733234211.35120fc878802e3b6829e6d7b597b44c. listed in META on region server ubuntu64-build01.sf.cloudera.com,51134,1321733229687 but found on region server ubuntu64-build01.sf.cloudera.com,38112,1321733229583
          Trying to fix assignment error...
          ...
          // hbck after fix
          ERROR: Region tableBadMetaAssign,,1321733234211.35120fc878802e3b6829e6d7b597b44c. listed in META on region server ubuntu64-build01.sf.cloudera.com,51134,1321733229687 but found on region server ubuntu64-build01.sf.cloudera.com,38112,1321733229583
          

          Note that the ZK assignment changed but meta had not yet changed.

          // hbck -fix
          ERROR: Region tableBadMetaAssign,,1321719700727.af24fbbe3e1df676b8e31e3ff5765fb6. listed in META on region server p0123.sf.cloudera.com,36067,1321719696277 but found on region server p0123.sf.cloudera.com,54221,1321719696237
          Trying to fix assignment error...
          ...
          // hbck after fix
          ERROR: Region tableBadMetaAssign,,1321719700727.af24fbbe3e1df676b8e31e3ff5765fb6. listed in META on region server p0123.sf.cloudera.com,36067,1321719696277 but found on region server p0123.sf.cloudera.com,59522,1321719696305
          
          Show
          Jonathan Hsieh added a comment - Output Examples: Note that the ZK assignment and the META assignment did not change. // hbck -fix call ERROR: Region tableBadMetaAssign,,1321733234211.35120fc878802e3b6829e6d7b597b44c. listed in META on region server ubuntu64-build01.sf.cloudera.com,51134,1321733229687 but found on region server ubuntu64-build01.sf.cloudera.com,38112,1321733229583 Trying to fix assignment error... ... // hbck after fix ERROR: Region tableBadMetaAssign,,1321733234211.35120fc878802e3b6829e6d7b597b44c. listed in META on region server ubuntu64-build01.sf.cloudera.com,51134,1321733229687 but found on region server ubuntu64-build01.sf.cloudera.com,38112,1321733229583 Note that the ZK assignment changed but meta had not yet changed. // hbck -fix ERROR: Region tableBadMetaAssign,,1321719700727.af24fbbe3e1df676b8e31e3ff5765fb6. listed in META on region server p0123.sf.cloudera.com,36067,1321719696277 but found on region server p0123.sf.cloudera.com,54221,1321719696237 Trying to fix assignment error... ... // hbck after fix ERROR: Region tableBadMetaAssign,,1321719700727.af24fbbe3e1df676b8e31e3ff5765fb6. listed in META on region server p0123.sf.cloudera.com,36067,1321719696277 but found on region server p0123.sf.cloudera.com,59522,1321719696305
          Hide
          stack added a comment -

          Is this an hbck issue Jon or are is it in our recovery code?

          Show
          stack added a comment - Is this an hbck issue Jon or are is it in our recovery code?
          Hide
          Jonathan Hsieh added a comment -

          Hm.. this looks like a race or due to the lack of a rendezvous of some sort. Up to HBASE-4378, there was a 15000ms (yikes 15 sec!) sleep between the 'hbck -fix' call and the subsequent 'hbck' call that is supposed to be clean. HBASE-4703 removed this.

          My hunch is that maybe the update to META the 'hbck -fix' does isn't seen on the second 'hbck' run.

          https://github.com/apache/hbase/commit/6ca0e79a6ac92190238d5cda56f787ab9702d7fc#L61L138
          TestHBaseFsck.java:138

          Show
          Jonathan Hsieh added a comment - Hm.. this looks like a race or due to the lack of a rendezvous of some sort. Up to HBASE-4378 , there was a 15000ms (yikes 15 sec!) sleep between the 'hbck -fix' call and the subsequent 'hbck' call that is supposed to be clean. HBASE-4703 removed this. My hunch is that maybe the update to META the 'hbck -fix' does isn't seen on the second 'hbck' run. https://github.com/apache/hbase/commit/6ca0e79a6ac92190238d5cda56f787ab9702d7fc#L61L138 TestHBaseFsck.java:138
          Hide
          Jonathan Hsieh added a comment -

          The story behind this problem.

          HBCK repairs a bad assignment using the admin interface to reassign a particular region. offlining the region in ZK. This calls master.assign – eventually the master uses its serverManager and issues an HRegionServer.openRegion().

          Looks like the HRegionServer.openRegion being essentially asynchronous and causes the failure. The call submits an OpenRegionHandler (ORH) callback to the RS's ExecutorService and then immediately returns the RegionState to OPENED.

          The ORH thread calls ORH.process -> updateMeta, which creates a PostOpenDeployTaskThread and starts another thread that calls HRegionServer.postOpenDeployTasks -> MetaEditor.updateRegionLocation which updates the meta table.

          The problem is that the RegionState OPENED is reported to the master even though it may not have written all its new assignment to META yet.

          Show
          Jonathan Hsieh added a comment - The story behind this problem. HBCK repairs a bad assignment using the admin interface to reassign a particular region. offlining the region in ZK. This calls master.assign – eventually the master uses its serverManager and issues an HRegionServer.openRegion(). Looks like the HRegionServer.openRegion being essentially asynchronous and causes the failure. The call submits an OpenRegionHandler (ORH) callback to the RS's ExecutorService and then immediately returns the RegionState to OPENED. The ORH thread calls ORH.process -> updateMeta, which creates a PostOpenDeployTaskThread and starts another thread that calls HRegionServer.postOpenDeployTasks -> MetaEditor.updateRegionLocation which updates the meta table. The problem is that the RegionState OPENED is reported to the master even though it may not have written all its new assignment to META yet.
          Hide
          Jonathan Hsieh added a comment -

          I've attached a patch that inserts a sleep into the RegionServer code right before writing to meta which causes the test to fail consistently. There are some hanging threads if you run this using mvn. I ran the change in eclipse as a unit test where it fails the test (but the unit test remains hung).

          Show
          Jonathan Hsieh added a comment - I've attached a patch that inserts a sleep into the RegionServer code right before writing to meta which causes the test to fail consistently. There are some hanging threads if you run this using mvn. I ran the change in eclipse as a unit test where it fails the test (but the unit test remains hung).
          Hide
          stack added a comment -

          So, how you think we fix Jon?

          Show
          stack added a comment - So, how you think we fix Jon?
          Hide
          Jonathan Hsieh added a comment -

          With this patch, I add a 1s delay between 'hbck -fix' and 'hbck'. The test new seems to pass consistently (ran 20x).

          Show
          Jonathan Hsieh added a comment - With this patch, I add a 1s delay between 'hbck -fix' and 'hbck'. The test new seems to pass consistently (ran 20x).
          Hide
          Jonathan Hsieh added a comment -

          Stack.

          For now the adding a sleep. Longer term adding some synchronization options for the open region call or add update the regions state to returning something like OPENING state and then OPEN state after meta and zk have been updated.

          Show
          Jonathan Hsieh added a comment - Stack. For now the adding a sleep. Longer term adding some synchronization options for the open region call or add update the regions state to returning something like OPENING state and then OPEN state after meta and zk have been updated.
          Hide
          Jonathan Hsieh added a comment -

          Looks like it failed on iteration 39..

          Show
          Jonathan Hsieh added a comment - Looks like it failed on iteration 39..
          Hide
          Jonathan Hsieh added a comment -

          Hm.. minicluster failed to start properly in that one. Seems likely due to problem in there somewhere.

          Show
          Jonathan Hsieh added a comment - Hm.. minicluster failed to start properly in that one. Seems likely due to problem in there somewhere.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12504690/hbase-4842.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -162 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 65 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.client.TestAdmin
          org.apache.hadoop.hbase.replication.TestReplication

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/324//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/324//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/324//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12504690/hbase-4842.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -162 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 65 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestAdmin org.apache.hadoop.hbase.replication.TestReplication Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/324//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/324//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/324//console This message is automatically generated.
          Hide
          stack added a comment -

          The problem is that the RegionState OPENED is reported to the master even though it may not have written all its new assignment to META yet.

          This return is ignored by the master, right? Unless its returning RegionOpeningState.ALREADY_OPENED?

          Should I commit this patch meantime? It seems to make things better for the moment (Sorry for my being anxious on this one and thanks for working on it Jon... but I can't build 0.92 locally because of this and I want to cut an RC – thanks for the help).

          Show
          stack added a comment - The problem is that the RegionState OPENED is reported to the master even though it may not have written all its new assignment to META yet. This return is ignored by the master, right? Unless its returning RegionOpeningState.ALREADY_OPENED? Should I commit this patch meantime? It seems to make things better for the moment (Sorry for my being anxious on this one and thanks for working on it Jon... but I can't build 0.92 locally because of this and I want to cut an RC – thanks for the help).
          Hide
          stack added a comment -

          J-D suggests disabling distributed splitting helps w/ this issue:

          Index: src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
          ===================================================================
          --- src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java       (revision 1204807)
          +++ src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java       (working copy)
          @@ -70,6 +70,7 @@
             
             @BeforeClass
             public static void setUpBeforeClass() throws Exception {
          +    TEST_UTIL.getConfiguration().setBoolean("hbase.master.distributed.log.splitting", false);
               TEST_UTIL.startMiniCluster(3);
             }
          

          I've run it 10 times and it seems to work. Running it 30 times more to see if I can beat Jon's fix.

          Show
          stack added a comment - J-D suggests disabling distributed splitting helps w/ this issue: Index: src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java =================================================================== --- src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java (revision 1204807) +++ src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java (working copy) @@ -70,6 +70,7 @@ @BeforeClass public static void setUpBeforeClass() throws Exception { + TEST_UTIL.getConfiguration().setBoolean( "hbase.master.distributed.log.splitting" , false ); TEST_UTIL.startMiniCluster(3); } I've run it 10 times and it seems to work. Running it 30 times more to see if I can beat Jon's fix.
          Hide
          stack added a comment -

          nvm. It failed on the 11th try.

          Show
          stack added a comment - nvm. It failed on the 11th try.
          Hide
          stack added a comment -

          Hmm... might have had something else running at same time that caused it fail.

          Show
          stack added a comment - Hmm... might have had something else running at same time that caused it fail.
          Hide
          stack added a comment -

          Here is a patch w/ Jon's 1 second pause and disabling distributed splitting. I applied it to 0.92 for now.

          Show
          stack added a comment - Here is a patch w/ Jon's 1 second pause and disabling distributed splitting. I applied it to 0.92 for now.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12504704/4842-v3.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/329//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12504704/4842-v3.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/329//console This message is automatically generated.
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92-security #6 (See https://builds.apache.org/job/HBase-0.92-security/6/)
          HBASE-4842 [hbck] Fix intermittent failures on TestHBaseFsck.testHBaseFsck

          stack :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
          Show
          Hudson added a comment - Integrated in HBase-0.92-security #6 (See https://builds.apache.org/job/HBase-0.92-security/6/ ) HBASE-4842 [hbck] Fix intermittent failures on TestHBaseFsck.testHBaseFsck stack : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #159 (See https://builds.apache.org/job/HBase-0.92/159/)
          HBASE-4842 [hbck] Fix intermittent failures on TestHBaseFsck.testHBaseFsck

          stack :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
          Show
          Hudson added a comment - Integrated in HBase-0.92 #159 (See https://builds.apache.org/job/HBase-0.92/159/ ) HBASE-4842 [hbck] Fix intermittent failures on TestHBaseFsck.testHBaseFsck stack : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
          Hide
          stack added a comment -

          Committed to TRUNK for completeness sake.

          I've messed up your issue Jon. Please forgive me. Will we resolve this one and open a new one for the root issue?

          Show
          stack added a comment - Committed to TRUNK for completeness sake. I've messed up your issue Jon. Please forgive me. Will we resolve this one and open a new one for the root issue?
          Hide
          Jonathan Hsieh added a comment -

          I'll file a new issue.

          The main issue isn't what is returned, but when. With the first 'hbck -fix', the master makes a call to the regionserver to issue a request open the region (which adds data to meta). This returns right away. The next hbck call will cause the master query meta again which is used to check consistency. Sometimes the new meta entries are fixed before the second hbck call is done (failing the test), sometimes it is not (not failing).

          The slight delay allows the open request to finish and the meta entry to be updated before the subsequent 'hbck' call.

          Show
          Jonathan Hsieh added a comment - I'll file a new issue. The main issue isn't what is returned, but when. With the first 'hbck -fix', the master makes a call to the regionserver to issue a request open the region (which adds data to meta). This returns right away. The next hbck call will cause the master query meta again which is used to check consistency. Sometimes the new meta entries are fixed before the second hbck call is done (failing the test), sometimes it is not (not failing). The slight delay allows the open request to finish and the meta entry to be updated before the subsequent 'hbck' call.
          Hide
          Jonathan Hsieh added a comment -

          Also, I don't think dist log splitting has anything do to with this failure.

          Show
          Jonathan Hsieh added a comment - Also, I don't think dist log splitting has anything do to with this failure.
          Hide
          stack added a comment -

          Also, I don't think dist log splitting has anything do to with this failure.

          True. This was a misunderstanding on my part of a J-D comment up on IRC.

          Show
          stack added a comment - Also, I don't think dist log splitting has anything do to with this failure. True. This was a misunderstanding on my part of a J-D comment up on IRC.
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2473 (See https://builds.apache.org/job/HBase-TRUNK/2473/)
          HBASE-4842 [hbck] Fix intermittent failures on TestHBaseFsck.testHBaseFsck

          stack :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2473 (See https://builds.apache.org/job/HBase-TRUNK/2473/ ) HBASE-4842 [hbck] Fix intermittent failures on TestHBaseFsck.testHBaseFsck stack : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-security #6 (See https://builds.apache.org/job/HBase-TRUNK-security/6/)
          HBASE-4842 [hbck] Fix intermittent failures on TestHBaseFsck.testHBaseFsck

          stack :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-security #6 (See https://builds.apache.org/job/HBase-TRUNK-security/6/ ) HBASE-4842 [hbck] Fix intermittent failures on TestHBaseFsck.testHBaseFsck stack : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java
          Hide
          Jonathan Hsieh added a comment -

          This was reviewed and committed by stack back in 0.92.

          Show
          Jonathan Hsieh added a comment - This was reviewed and committed by stack back in 0.92.

            People

            • Assignee:
              Jonathan Hsieh
              Reporter:
              Jonathan Hsieh
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development