HBase
  1. HBase
  2. HBASE-6070

AM.nodeDeleted and SSH races creating problems for regions under SPLIT

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.92.1, 0.94.0
    • Fix Version/s: 0.94.1, 0.95.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We tried to address the problems in Master restart and RS restart while SPLIT region is in progress as part of HBASE-5806.
      While doing some more we found still there is one race condition.
      -> Split has just started and the znode is in RS_SPLIT state.
      -> RS goes down.
      -> First call back for SSH comes.
      -> As part of the fix for HBASE-5806 SSH knows that some region is in RIT.
      -> But now nodeDeleted event comes for the SPLIt node and there we try to delete the RIT.
      -> After this we try to see in the SSH whether any node is in RIT. As we dont find the region in RIT the region is never assigned.

      When we fixed HBASE-5806 step 6 happened first and then step 5 happened. So we missed it. Now we found that. Will come up with a patch shortly.

      1. HBASE-6070_0.92.patch
        3 kB
        ramkrishna.s.vasudevan
      2. HBASE-6070_0.94.patch
        9 kB
        ramkrishna.s.vasudevan
      3. HBASE-6070_trunk.patch
        10 kB
        ramkrishna.s.vasudevan
      4. HBASE-6070_0.92_1.patch
        3 kB
        ramkrishna.s.vasudevan
      5. HBASE-6070_0.94_1.patch
        9 kB
        ramkrishna.s.vasudevan
      6. HBASE-6070_trunk_1.patch
        10 kB
        ramkrishna.s.vasudevan

        Activity

        ramkrishna.s.vasudevan created issue -
        Hide
        ramkrishna.s.vasudevan added a comment -

        I plan to make the following change in AM.nodeDeleted. Currently as SSH is trying to handle the RIT in splitting state doing the same in AM.nodeDeleted leads to race.

        -        if (rs.isSplitting() || rs.isSplit()) {
        +        if (rs.isSplit()) {
                   LOG.debug("Ephemeral node deleted, regionserver crashed?, " +
                     "clearing from RIT; rs=" + rs);
                   regionOffline(rs.getRegion());
        

        Pls provide your suggestions.

        Show
        ramkrishna.s.vasudevan added a comment - I plan to make the following change in AM.nodeDeleted. Currently as SSH is trying to handle the RIT in splitting state doing the same in AM.nodeDeleted leads to race. - if (rs.isSplitting() || rs.isSplit()) { + if (rs.isSplit()) { LOG.debug( "Ephemeral node deleted, regionserver crashed?, " + "clearing from RIT; rs=" + rs); regionOffline(rs.getRegion()); Pls provide your suggestions.
        ramkrishna.s.vasudevan made changes -
        Field Original Value New Value
        Assignee ramkrishna.s.vasudevan [ ram_krish ]
        ramkrishna.s.vasudevan made changes -
        Attachment HBASE-6070_0.92.patch [ 12528960 ]
        ramkrishna.s.vasudevan made changes -
        Attachment HBASE-6070_0.94.patch [ 12528961 ]
        Hide
        ramkrishna.s.vasudevan added a comment -

        Uploaded patches for all branches. Tested in cluster including scenarios for HBASE-5806. Pls review and provide your comments.

        Show
        ramkrishna.s.vasudevan added a comment - Uploaded patches for all branches. Tested in cluster including scenarios for HBASE-5806 . Pls review and provide your comments.
        ramkrishna.s.vasudevan made changes -
        Attachment HBASE-6070_trunk.patch [ 12528962 ]
        ramkrishna.s.vasudevan made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12528962/HBASE-6070_trunk.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 33 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.client.TestFromClientSide

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1981//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1981//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1981//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12528962/HBASE-6070_trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 hadoop23. The patch compiles against the hadoop 0.23.x profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 33 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestFromClientSide Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1981//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1981//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1981//console This message is automatically generated.
        Hide
        Ted Yu added a comment -
        +            // but the RS had went down before completing the split process then will not try to
        

        'had went down' -> 'had gone down'

        +      if(response == null) return null;
        

        Space after 'if'

        +  static Result getMetaTableRowResultAsSplittedRegion(final HRegionInfo hri, final ServerName sn)
        

        The method should be called getMetaTableRowResultAsSplitRegion().

        Should investigate the test failure in TestFromClientSide

        Show
        Ted Yu added a comment - + // but the RS had went down before completing the split process then will not try to 'had went down' -> 'had gone down' + if (response == null ) return null ; Space after 'if' + static Result getMetaTableRowResultAsSplittedRegion( final HRegionInfo hri, final ServerName sn) The method should be called getMetaTableRowResultAsSplitRegion(). Should investigate the test failure in TestFromClientSide
        ramkrishna.s.vasudevan made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        ramkrishna.s.vasudevan made changes -
        Attachment HBASE-6070_0.92_1.patch [ 12529075 ]
        ramkrishna.s.vasudevan made changes -
        Attachment HBASE-6070_0.94_1.patch [ 12529076 ]
        Hide
        ramkrishna.s.vasudevan added a comment -

        Updated patches fixing the comments. I tried running the failed testcase. It passed every time.

        Show
        ramkrishna.s.vasudevan added a comment - Updated patches fixing the comments. I tried running the failed testcase. It passed every time.
        ramkrishna.s.vasudevan made changes -
        Attachment HBASE-6070_trunk_1.patch [ 12529077 ]
        ramkrishna.s.vasudevan made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        ramkrishna.s.vasudevan made changes -
        Attachment HBASE-6070_trunk_1.patch [ 12529077 ]
        Hide
        ramkrishna.s.vasudevan added a comment -

        Just reattaching the patch.

        Show
        ramkrishna.s.vasudevan added a comment - Just reattaching the patch.
        ramkrishna.s.vasudevan made changes -
        Attachment HBASE-6070_trunk_1.patch [ 12529079 ]
        Hide
        Ted Yu added a comment -

        +1 on patch v2.

        You may want to verify that the failed test below wasn't related to this change:
        https://builds.apache.org/job/PreCommit-HBASE-Build/1987/console

        Show
        Ted Yu added a comment - +1 on patch v2. You may want to verify that the failed test below wasn't related to this change: https://builds.apache.org/job/PreCommit-HBASE-Build/1987/console
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Ted
        TestServerCustomProtocol.testSingleMethod() passes with the patch. I saw that even in someother precommit build the same has failed.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1993//testReport/

        Show
        ramkrishna.s.vasudevan added a comment - @Ted TestServerCustomProtocol.testSingleMethod() passes with the patch. I saw that even in someother precommit build the same has failed. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1993//testReport/
        Hide
        Ted Yu added a comment -

        All right.

        Show
        Ted Yu added a comment - All right.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Committed to trunk, 0.94 and 0.92.
        Thanks for the review Ted.

        Show
        ramkrishna.s.vasudevan added a comment - Committed to trunk, 0.94 and 0.92. Thanks for the review Ted.
        ramkrishna.s.vasudevan made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #217 (See https://builds.apache.org/job/HBase-0.94/217/)
        HBASE-6070 AM.nodeDeleted and SSH races creating problems for regions under SPLIT (Ramkrishna) (Revision 1342725)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #217 (See https://builds.apache.org/job/HBase-0.94/217/ ) HBASE-6070 AM.nodeDeleted and SSH races creating problems for regions under SPLIT (Ramkrishna) (Revision 1342725) Result = FAILURE ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2922 (See https://builds.apache.org/job/HBase-TRUNK/2922/)
        HBASE-6070 AM.nodeDeleted and SSH races creating problems for regions under SPLIT (Ramkrishna) (Revision 1342724)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/Mocking.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2922 (See https://builds.apache.org/job/HBase-TRUNK/2922/ ) HBASE-6070 AM.nodeDeleted and SSH races creating problems for regions under SPLIT (Ramkrishna) (Revision 1342724) Result = FAILURE ramkrishna : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/Mocking.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92 #421 (See https://builds.apache.org/job/HBase-0.92/421/)
        HBASE-6070 AM.nodeDeleted and SSH races creating problems for regions under SPLIT (Ramkrishna) (Revision 1342727)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        Show
        Hudson added a comment - Integrated in HBase-0.92 #421 (See https://builds.apache.org/job/HBase-0.92/421/ ) HBASE-6070 AM.nodeDeleted and SSH races creating problems for regions under SPLIT (Ramkrishna) (Revision 1342727) Result = FAILURE ramkrishna : Files : /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #16 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/16/)
        HBASE-6070 AM.nodeDeleted and SSH races creating problems for regions under SPLIT (Ramkrishna) (Revision 1342724)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/Mocking.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #16 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/16/ ) HBASE-6070 AM.nodeDeleted and SSH races creating problems for regions under SPLIT (Ramkrishna) (Revision 1342724) Result = FAILURE ramkrishna : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/Mocking.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #32 (See https://builds.apache.org/job/HBase-0.94-security/32/)
        HBASE-6070 AM.nodeDeleted and SSH races creating problems for regions under SPLIT (Ramkrishna) (Revision 1342725)

        Result = SUCCESS
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #32 (See https://builds.apache.org/job/HBase-0.94-security/32/ ) HBASE-6070 AM.nodeDeleted and SSH races creating problems for regions under SPLIT (Ramkrishna) (Revision 1342725) Result = SUCCESS ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
        ramkrishna.s.vasudevan made changes -
        Hadoop Flags Reviewed [ 10343 ]
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92-security #109 (See https://builds.apache.org/job/HBase-0.92-security/109/)
        HBASE-6070 AM.nodeDeleted and SSH races creating problems for regions under SPLIT (Ramkrishna) (Revision 1342727)

        Result = SUCCESS
        ramkrishna :
        Files :

        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        Show
        Hudson added a comment - Integrated in HBase-0.92-security #109 (See https://builds.apache.org/job/HBase-0.92-security/109/ ) HBASE-6070 AM.nodeDeleted and SSH races creating problems for regions under SPLIT (Ramkrishna) (Revision 1342727) Result = SUCCESS ramkrishna : Files : /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        Lars Hofhansl made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Tianying Chang added a comment -

        @ram,

        I am reading the code related to region split. I feel that this code below in AssignmentManager seems to be dead code. Because 1) I don't see any place that callls to update the regionState to be State.SPLIT. 2) for scenario when region has already been split and RS crashed, ServerShutdownHandler should have already taken care of it. Am I missing something here. Thanks

        if (rs.isSplit()) {
        LOG.debug("Ephemeral node deleted, regionserver crashed?, " +
        "clearing from RIT; rs=" + rs);
        regionOffline(rs.getRegion());

        Show
        Tianying Chang added a comment - @ram, I am reading the code related to region split. I feel that this code below in AssignmentManager seems to be dead code. Because 1) I don't see any place that callls to update the regionState to be State.SPLIT. 2) for scenario when region has already been split and RS crashed, ServerShutdownHandler should have already taken care of it. Am I missing something here. Thanks if (rs.isSplit()) { LOG.debug("Ephemeral node deleted, regionserver crashed?, " + "clearing from RIT; rs=" + rs); regionOffline(rs.getRegion());
        Hide
        stack added a comment -

        Tianying Chang Would you mind making a new issue to remove the dead code? Thank you.

        Show
        stack added a comment - Tianying Chang Would you mind making a new issue to remove the dead code? Thank you.
        Hide
        Tianying Chang added a comment -

        @stack

        Thanks. I want to get some second opinion from others. I guess it is better to do this by opening a separate jira. I have created HBASE-7058 for this purpose. If other people found no other potential problem, I can provide patch.

        Show
        Tianying Chang added a comment - @stack Thanks. I want to get some second opinion from others. I guess it is better to do this by opening a separate jira. I have created HBASE-7058 for this purpose. If other people found no other potential problem, I can provide patch.
        stack made changes -
        Fix Version/s 0.95.0 [ 12324094 ]
        Fix Version/s 0.92.2 [ 12319888 ]
        Fix Version/s 0.96.0 [ 12320040 ]
        Fix Version/s 0.94.1 [ 12320257 ]
        Lars Hofhansl made changes -
        Fix Version/s 0.94.1 [ 12320257 ]

          People

          • Assignee:
            ramkrishna.s.vasudevan
            Reporter:
            ramkrishna.s.vasudevan
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development