HBase
  1. HBase
  2. HBASE-4562

When split doing offlineParentInMeta encounters error, it'll cause data loss

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.90.4
    • Fix Version/s: 0.90.5
    • Component/s: regionserver
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Follow below steps to replay the problem:
      1. change the SplitTransaction.java as below,just like mock the timeout error.

      SplitTransaction.java
            if (!testing) {
              MetaEditor.offlineParentInMeta(server.getCatalogTracker(),
                 this.parent.getRegionInfo(), a.getRegionInfo(), b.getRegionInfo());
              throw new IOException("some unexpected error in split");
            }
         

      2. update the regionserver code,restart;
      3. create a table & put some data to the table;
      4. split the table;
      5. kill the regionserver hosted the table;
      6. wait some time after master ServerShutdownHandler.process execute,then scan the table,u'll find the data wrote before lost.

      We can fix the bug just use the patch.

      1. test-4562-trunk.txt
        31 kB
        bluedavy
      2. test-4562-0.92.txt
        30 kB
        bluedavy
      3. test-4562-0.90.txt
        21 kB
        bluedavy
      4. test-4562-0.90.4.txt
        21 kB
        bluedavy
      5. HBASE-4562-trunk.patch
        3 kB
        bluedavy
      6. HBASE-4562-0.92.patch
        3 kB
        bluedavy
      7. HBASE-4562-0.90.patch
        2 kB
        bluedavy
      8. HBASE-4562-0.90.4.patch
        2 kB
        bluedavy

        Activity

        Hide
        Ted Yu added a comment -

        Normally one patch should fix one problem.

        +    } catch(IOException e){
        +      throw e;
        

        I saw the above code twice in the patch.
        Can you explain why it is needed ?

        Please also run test suite and tell us the results.

        Show
        Ted Yu added a comment - Normally one patch should fix one problem. + } catch (IOException e){ + throw e; I saw the above code twice in the patch. Can you explain why it is needed ? Please also run test suite and tell us the results.
        Hide
        bluedavy added a comment -

        @Ted Yu
        OK, I attached the patch again,and also provide the test suite results.

        Show
        bluedavy added a comment - @Ted Yu OK, I attached the patch again,and also provide the test suite results.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Is it possible to write a testcase with the scenario you have told in the description?
        Minor comment:

        +        } 
        +        else {
        +          this.server.abort("Abort; we got an error after point-of-no-return");
        +        }  
        

        + } else

        { + this.server.abort("Abort; we got an error after point-of-no-return"); + }


        else block should begin in the same line where if block ends.

        Show
        ramkrishna.s.vasudevan added a comment - Is it possible to write a testcase with the scenario you have told in the description? Minor comment: + } + else { + this .server.abort( "Abort; we got an error after point-of-no- return " ); + } + } else { + this.server.abort("Abort; we got an error after point-of-no-return"); + } else block should begin in the same line where if block ends.
        Hide
        Ted Yu added a comment -

        Apache Jenkins build for 0.90 has 678 tests. The attached test report had fewer test cases.

        Also, patch for 0.92/TRUNK is needed since the fix would go to those places as well.

        Show
        Ted Yu added a comment - Apache Jenkins build for 0.90 has 678 tests. The attached test report had fewer test cases. Also, patch for 0.92/TRUNK is needed since the fix would go to those places as well.
        Hide
        bluedavy added a comment -

        @Ted Yu
        I'll check the tests...
        Also I'll attach the patch for 0.92 & TRUNK.

        Show
        bluedavy added a comment - @Ted Yu I'll check the tests... Also I'll attach the patch for 0.92 & TRUNK.
        Hide
        bluedavy added a comment -

        @Ted Yu
        I count all tests in src/test,I'm sure all test are ran...

        Show
        bluedavy added a comment - @Ted Yu I count all tests in src/test,I'm sure all test are ran...
        Hide
        Ted Yu added a comment -

        Please run patch for 0.92 or TRUNK (hopefully both) through unit tests.
        Thanks

        Show
        Ted Yu added a comment - Please run patch for 0.92 or TRUNK (hopefully both) through unit tests. Thanks
        Hide
        stack added a comment -

        I'm +1 on this patch if tests pass. Its more conservative than what we currently have. Comments in code speculated that we should probably tend to the more conservative. Thanks for making the change bluedavy.

        Show
        stack added a comment - I'm +1 on this patch if tests pass. Its more conservative than what we currently have. Comments in code speculated that we should probably tend to the more conservative. Thanks for making the change bluedavy.
        Hide
        bluedavy added a comment -

        I attached patches & test reports for 0.90.4,0.92 and trunk.

        Show
        bluedavy added a comment - I attached patches & test reports for 0.90.4,0.92 and trunk.
        Hide
        Ted Yu added a comment -

        +1 on patches.

        Show
        Ted Yu added a comment - +1 on patches.
        Hide
        Lars Hofhansl added a comment -

        Are the attached patches the full story?
        I see only PONR moved, but no code to abort the server if timing out against .META. Where is that code?

        In the trunk patch the comment seems to be mangled. The big PONR comment be moved above the comment speculating the that PONR should be moved here, just like in the 0.90 and 0.92 patches...

        In fact should that comment now be removed in all patches?

        Show
        Lars Hofhansl added a comment - Are the attached patches the full story? I see only PONR moved, but no code to abort the server if timing out against .META. Where is that code? In the trunk patch the comment seems to be mangled. The big PONR comment be moved above the comment speculating the that PONR should be moved here, just like in the 0.90 and 0.92 patches... In fact should that comment now be removed in all patches?
        Hide
        Ted Yu added a comment -

        If OfflineParentInMeta() times out, SplitRequest.run() would execute the following code:

                  if (st.rollback(this.server, this.server)) {
                    LOG.info("Successful rollback of failed split of " +
                      parent.getRegionNameAsString());
                  } else {
                    this.server.abort("Abort; we got an error after point-of-no-return");
        

        I agree that the comments should be consistent in all patches.

        Show
        Ted Yu added a comment - If OfflineParentInMeta() times out, SplitRequest.run() would execute the following code: if (st.rollback( this .server, this .server)) { LOG.info( "Successful rollback of failed split of " + parent.getRegionNameAsString()); } else { this .server.abort( "Abort; we got an error after point-of-no- return " ); I agree that the comments should be consistent in all patches.
        Hide
        Lars Hofhansl added a comment -

        I see... Thanks Ted.

        Show
        Lars Hofhansl added a comment - I see... Thanks Ted.
        Hide
        bluedavy added a comment -

        I fix the comments to keep consistent in all patches.

        Show
        bluedavy added a comment - I fix the comments to keep consistent in all patches.
        Hide
        Lars Hofhansl added a comment -

        +1 for latest patches (assuming all tests pass)

        Show
        Lars Hofhansl added a comment - +1 for latest patches (assuming all tests pass)
        Hide
        bluedavy added a comment -

        wait for committer commit to the svn.

        Show
        bluedavy added a comment - wait for committer commit to the svn.
        Hide
        Lars Hofhansl added a comment -

        Committing this too.

        Show
        Lars Hofhansl added a comment - Committing this too.
        Hide
        Lars Hofhansl added a comment -

        0.90 patch fails to apply. @bluedavy, could you double check all patches against the latest of the 0.90, 0.92, trunk, respectively?

        Show
        Lars Hofhansl added a comment - 0.90 patch fails to apply. @bluedavy, could you double check all patches against the latest of the 0.90, 0.92, trunk, respectively?
        Hide
        bluedavy added a comment -

        the patch-0.90 is for 0.90.4...

        Show
        bluedavy added a comment - the patch-0.90 is for 0.90.4...
        Hide
        Ted Yu added a comment -

        @bluedavy:
        Please generate patch for latest 0.90 and rerun the test suite.

        Thanks

        Show
        Ted Yu added a comment - @bluedavy: Please generate patch for latest 0.90 and rerun the test suite. Thanks
        Hide
        bluedavy added a comment -

        em,OK,I renamed the current patch for 0.90.4.

        Show
        bluedavy added a comment - em,OK,I renamed the current patch for 0.90.4.
        Hide
        bluedavy added a comment -

        @Lars
        I attached the patch for latest 0.90,pls apply it again & commit,thks.

        Show
        bluedavy added a comment - @Lars I attached the patch for latest 0.90,pls apply it again & commit,thks.
        Hide
        Lars Hofhansl added a comment -

        Committed to 0.90, 0.92, and trunk.

        The 0.90 patch still did not apply to the current 0.90 branch. I applied the changes manually this time, but in the future it would be great to base patches of the latests state of the branch in SVN.

        Show
        Lars Hofhansl added a comment - Committed to 0.90, 0.92, and trunk. The 0.90 patch still did not apply to the current 0.90 branch. I applied the changes manually this time, but in the future it would be great to base patches of the latests state of the branch in SVN.
        Hide
        bluedavy added a comment -

        em,thks.

        Show
        bluedavy added a comment - em,thks.
        Hide
        Lars Hofhansl added a comment - - edited

        No problem. Thanks for the patch

        Show
        Lars Hofhansl added a comment - - edited No problem. Thanks for the patch
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2333 (See https://builds.apache.org/job/HBase-TRUNK/2333/)
        HBASE-4562 When split doing offlineParentInMeta encounters error, it'll cause data loss

        larsh :
        Files :

        • /hbase/trunk/CHANGES.txt
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2333 (See https://builds.apache.org/job/HBase-TRUNK/2333/ ) HBASE-4562 When split doing offlineParentInMeta encounters error, it'll cause data loss larsh : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92 #72 (See https://builds.apache.org/job/HBase-0.92/72/)
        HBASE-4562 When split doing offlineParentInMeta encounters error, it'll cause data loss

        larsh :
        Files :

        • /hbase/branches/0.92/CHANGES.txt
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
        Show
        Hudson added a comment - Integrated in HBase-0.92 #72 (See https://builds.apache.org/job/HBase-0.92/72/ ) HBASE-4562 When split doing offlineParentInMeta encounters error, it'll cause data loss larsh : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java

          People

          • Assignee:
            bluedavy
            Reporter:
            bluedavy
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development