HBase
  1. HBase
  2. HBASE-5806

Handle split region related failures on master restart and RS restart

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.92.1
    • Fix Version/s: 0.94.1, 0.95.0
    • Component/s: None
    • Labels:
      None

      Description

      This issue is raised to solve issues that comes out of partial region split happened and the region node in the ZK which is in RS_ZK_REGION_SPLITTING and RS_ZK_REGION_SPLIT is not yet processed.
      This also tries to address HBASE-5615.

      1. HBASE-5806.patch
        2 kB
        Chinna Rao Lalam
      2. HBASE-5806_trunk.patch
        12 kB
        Chinna Rao Lalam
      3. HBASE-5806_trunk_3.patch
        14 kB
        Chinna Rao Lalam
      4. HBASE-5806_trunk_2.patch
        14 kB
        Chinna Rao Lalam
      5. HBASE-5806_trunk_1.patch
        13 kB
        Chinna Rao Lalam
      6. HBASE-5806_0.94.patch
        13 kB
        Chinna Rao Lalam
      7. HBASE-5806_0.94_2.patch
        14 kB
        Chinna Rao Lalam
      8. HBASE-5806_0.94_1.patch
        13 kB
        Chinna Rao Lalam
      9. HBASE-5806_0.92.patch
        14 kB
        Chinna Rao Lalam

        Activity

        Hide
        ramkrishna.s.vasudevan added a comment -

        We will come up with a patch sooner.

        Show
        ramkrishna.s.vasudevan added a comment - We will come up with a patch sooner.
        Hide
        Chinna Rao Lalam added a comment -

        Here considered following scenarios:

        1.If the regionserver is restarted after removing the parent region from the online regions.
        2.If Master is restarted while doing the region split in RS and it is in the flow of tickling from SPLIT-SPLIT or SPLITTING-SPLIT.
        3.If Master is restarted after splitting is completely done and before deleting the region from META using catalogjanitor.

        In the first scenario the problem is, in ServerShutdownHandler while constructing hris it will check whether it is in RIT and is !isClosing and !isPendingClose and it will remove from hris. Remaining hris it will try to assign and no one will attempt to assign that region.

              // Skip regions that were in transition unless CLOSING or PENDING_CLOSE
              for (RegionState rit : regionsInTransition) {
                if (!rit.isClosing() && !rit.isPendingClose() && !rit.isSplitting()) {
                  LOG.debug("Removed " + rit.getRegion().getRegionNameAsString() +
                  " from list of regions to assign because in RIT; region state: " +
                  rit.getState());
                  if (hris != null) hris.remove(rit.getRegion());
                }
              }
        

        In the second scenario the problem is, in AssignmentManager while rebuildUserRegions() it should not consider the region which will have the znode with Split or Splitting state because region split might be completed or partially done.

        In the third scenario the problem is, region split is completely done and it is not yet deleted from META using catalogjanitor so in
        AssignmentManager while rebuildUserRegions() it should not consider this region.

        Show
        Chinna Rao Lalam added a comment - Here considered following scenarios: 1.If the regionserver is restarted after removing the parent region from the online regions. 2.If Master is restarted while doing the region split in RS and it is in the flow of tickling from SPLIT-SPLIT or SPLITTING-SPLIT. 3.If Master is restarted after splitting is completely done and before deleting the region from META using catalogjanitor. In the first scenario the problem is, in ServerShutdownHandler while constructing hris it will check whether it is in RIT and is !isClosing and !isPendingClose and it will remove from hris. Remaining hris it will try to assign and no one will attempt to assign that region. // Skip regions that were in transition unless CLOSING or PENDING_CLOSE for (RegionState rit : regionsInTransition) { if (!rit.isClosing() && !rit.isPendingClose() && !rit.isSplitting()) { LOG.debug( "Removed " + rit.getRegion().getRegionNameAsString() + " from list of regions to assign because in RIT; region state: " + rit.getState()); if (hris != null ) hris.remove(rit.getRegion()); } } In the second scenario the problem is, in AssignmentManager while rebuildUserRegions() it should not consider the region which will have the znode with Split or Splitting state because region split might be completed or partially done. In the third scenario the problem is, region split is completely done and it is not yet deleted from META using catalogjanitor so in AssignmentManager while rebuildUserRegions() it should not consider this region.
        Hide
        Chinna Rao Lalam added a comment -

        I uploaded initial patch not yet tested. I am analyzing more on this.

        Show
        Chinna Rao Lalam added a comment - I uploaded initial patch not yet tested. I am analyzing more on this.
        Hide
        stack added a comment -

        Chinna This is good work. Can you provide more detail on the scenarios you are fixing? I'm trying to make sure I understand. For example, for #1 above, we are talking about the regionserver crashing somewhere here (from SplitTransaction):

            if (!testing) {
              services.removeFromOnlineRegions(this.parent.getRegionInfo().getEncodedName());
            }
            this.journal.add(JournalEntry.OFFLINED_PARENT);
        

        Is that right?

        The patch seems reasonable. Could we make a test at all? There are some tests around splitting already w/ Mocks. Could we add to these?

        Good stuff.

        Show
        stack added a comment - Chinna This is good work. Can you provide more detail on the scenarios you are fixing? I'm trying to make sure I understand. For example, for #1 above, we are talking about the regionserver crashing somewhere here (from SplitTransaction): if (!testing) { services.removeFromOnlineRegions( this .parent.getRegionInfo().getEncodedName()); } this .journal.add(JournalEntry.OFFLINED_PARENT); Is that right? The patch seems reasonable. Could we make a test at all? There are some tests around splitting already w/ Mocks. Could we add to these? Good stuff.
        Hide
        Chinna Rao Lalam added a comment -

        for #1 above,
        RegionServer is crashed at SplitTransaction.createDaughters(Server, RegionServerServices) in while removing from online regions()

            if (!testing) {
              services.removeFromOnlineRegions(this.parent.getRegionInfo().getEncodedName());
            }
        

        Here where ever the regionserver is crashed the ephemeral node will be deleted and master will get the notification of nodeDeleted() where it will be cleared from RIT

        But the ServerShutdownHandler executed first than the nodeDeleted() event for the region node.
        You can see that from the below logs

        2012-04-06 14:35:08,841 DEBUG org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Removed test,,1333702991530.cdfa837563e75ac5f4dc128680cc8da8. from list of regions to assign because in RIT; region state: SPLITTING
        
        2012-04-06 14:35:12,981 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Ephemeral node deleted, regionserver crashed?, clearing from RIT; rs=test,,1333702991530.cdfa837563e75ac5f4dc128680cc8da8. state=SPLITTING, ts=1333703059260, server=HOST-10-18-40-25,60020,1333695183392
        

        In this situation the below code populated that region

          List<RegionState> regionsInTransition =
                this.services.getAssignmentManager().
                  processServerShutdown(this.serverName);
        

        and it is in !rit.isClosing() && !rit.isPendingClose() so the region is deleted from the hris

              for (RegionState rit : regionsInTransition) {
                if (!rit.isClosing() && !rit.isPendingClose()) {
                  LOG.debug("Removed " + rit.getRegion().getRegionNameAsString() +
                  " from list of regions to assign because in RIT; region state: " +
                  rit.getState());
                  if (hris != null) hris.remove(rit.getRegion());
                }
              }
        

        The fix in SSH addresses #1.
        #2 came because of HBASE-5615. However HBASE-5615 was reverted.
        #3 comes when master restarts after sp1itting is done and before CJ has cleared the region from META. So while rebuilding the user region we ensure that the offlined parent region is not again taken into account.

        #2 and #3 are together taken care in this patch such that the fix does solve both the problems.

        Show
        Chinna Rao Lalam added a comment - for #1 above, RegionServer is crashed at SplitTransaction.createDaughters(Server, RegionServerServices) in while removing from online regions() if (!testing) { services.removeFromOnlineRegions( this .parent.getRegionInfo().getEncodedName()); } Here where ever the regionserver is crashed the ephemeral node will be deleted and master will get the notification of nodeDeleted() where it will be cleared from RIT But the ServerShutdownHandler executed first than the nodeDeleted() event for the region node. You can see that from the below logs 2012-04-06 14:35:08,841 DEBUG org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Removed test,,1333702991530.cdfa837563e75ac5f4dc128680cc8da8. from list of regions to assign because in RIT; region state: SPLITTING 2012-04-06 14:35:12,981 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Ephemeral node deleted, regionserver crashed?, clearing from RIT; rs=test,,1333702991530.cdfa837563e75ac5f4dc128680cc8da8. state=SPLITTING, ts=1333703059260, server=HOST-10-18-40-25,60020,1333695183392 In this situation the below code populated that region List<RegionState> regionsInTransition = this .services.getAssignmentManager(). processServerShutdown( this .serverName); and it is in !rit.isClosing() && !rit.isPendingClose() so the region is deleted from the hris for (RegionState rit : regionsInTransition) { if (!rit.isClosing() && !rit.isPendingClose()) { LOG.debug( "Removed " + rit.getRegion().getRegionNameAsString() + " from list of regions to assign because in RIT; region state: " + rit.getState()); if (hris != null ) hris.remove(rit.getRegion()); } } The fix in SSH addresses #1. #2 came because of HBASE-5615 . However HBASE-5615 was reverted. #3 comes when master restarts after sp1itting is done and before CJ has cleared the region from META. So while rebuilding the user region we ensure that the offlined parent region is not again taken into account. #2 and #3 are together taken care in this patch such that the fix does solve both the problems.
        Hide
        Chinna Rao Lalam added a comment -

        Updated the patch with testcases. In trunk patch added this change in AssignmentManager.java to address the HBASE-5654 issue

               default:
        -        throw new IllegalStateException("Received event is not valid.");
        +        break;
               }
        
        Show
        Chinna Rao Lalam added a comment - Updated the patch with testcases. In trunk patch added this change in AssignmentManager.java to address the HBASE-5654 issue default : - throw new IllegalStateException( "Received event is not valid." ); + break ; }
        Hide
        Chinna Rao Lalam added a comment -

        Updated patch for 0.94 & trunk.

        Show
        Chinna Rao Lalam added a comment - Updated patch for 0.94 & trunk.
        Hide
        Ted Yu added a comment -

        Overall, patch looks good.

               default:
        -        throw new IllegalStateException("Received event is not valid.");
        +        break;
        

        Should the event be logged in the default case ?

        +          // If Znode not exists dont consider this region
        +          if (data == null) {
        

        'not exists' -> 'does not exist'

        +  public static class MockedMaster extends HMaster {
        

        I think the class should be private. A better name would be MockMasterWithoutCatalogJanitor.

        Show
        Ted Yu added a comment - Overall, patch looks good. default : - throw new IllegalStateException( "Received event is not valid." ); + break ; Should the event be logged in the default case ? + // If Znode not exists dont consider this region + if (data == null ) { 'not exists' -> 'does not exist' + public static class MockedMaster extends HMaster { I think the class should be private. A better name would be MockMasterWithoutCatalogJanitor.
        Hide
        ramkrishna.s.vasudevan added a comment -
        default:
        -        throw new IllegalStateException("Received event is not valid.");
        +        break;
        

        Good that this testcase brought it out.

        +  public static class MockedMaster extends HMaster {
        

        It has to be public otherwise the LocalHBaseCluster does not allow us to update the HMaster class.

        Show
        ramkrishna.s.vasudevan added a comment - default : - throw new IllegalStateException( "Received event is not valid." ); + break ; Good that this testcase brought it out. + public static class MockedMaster extends HMaster { It has to be public otherwise the LocalHBaseCluster does not allow us to update the HMaster class.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12525862/HBASE-5806_trunk.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1787//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1787//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1787//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525862/HBASE-5806_trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 hadoop23. The patch compiles against the hadoop 0.23.x profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1787//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1787//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1787//console This message is automatically generated.
        Hide
        Chinna Rao Lalam added a comment -

        Thanks for the review Ted. Updated the patch with the review comments.

        Show
        Chinna Rao Lalam added a comment - Thanks for the review Ted. Updated the patch with the review comments.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12525884/HBASE-5806_trunk_1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1790//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1790//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1790//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525884/HBASE-5806_trunk_1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 hadoop23. The patch compiles against the hadoop 0.23.x profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1790//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1790//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1790//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        +1 on patch v2.

        +  private MockMasterWithoutCatalogJanitor abortAndWaitForMaster() throws IOException, InterruptedException {
        +    cluster.abortMaster(0);
        +    cluster.waitOnMaster(0);
        +    cluster.getConfiguration().setClass(HConstants.MASTER_IMPL, MockMasterWithoutCatalogJanitor.class, HMaster.class);
        

        Please wrap the two lines whose length exceeds 100 chars at time of integration.

        Show
        Ted Yu added a comment - +1 on patch v2. + private MockMasterWithoutCatalogJanitor abortAndWaitForMaster() throws IOException, InterruptedException { + cluster.abortMaster(0); + cluster.waitOnMaster(0); + cluster.getConfiguration().setClass(HConstants.MASTER_IMPL, MockMasterWithoutCatalogJanitor.class, HMaster.class); Please wrap the two lines whose length exceeds 100 chars at time of integration.
        Hide
        stack added a comment -

        @Chinna Very nice work. Thanks. I can wrap the long lines on commit. Here's some questions on the patch just out of interest:

        -    JVMClusterUtil.MasterThread mt =
        -      JVMClusterUtil.createMasterThread(c,
        -        this.masterClass, index);
        +    JVMClusterUtil.MasterThread mt = JVMClusterUtil.createMasterThread(c,
        +        (Class<? extends HMaster>) c.getClass(HConstants.MASTER_IMPL, HMaster.class), index);
             this.masterThreads.add(mt);
        

        What brought on the above change? Was this needed so you could add your mocking tests or is it that you fellas are doing a master subclass?

        I don't understand why we need this change (because of hbase-5654 ?):

        default:
        -        throw new IllegalStateException("Received event is not valid.");
        +        break;
               }
        

        Log that we are going to pass on a region here?

        +          // If znode does not exist dont consider this region
        +          if (data == null) {
        +            continue;
        +          }
        

        At debug level?

        Show
        stack added a comment - @Chinna Very nice work. Thanks. I can wrap the long lines on commit. Here's some questions on the patch just out of interest: - JVMClusterUtil.MasterThread mt = - JVMClusterUtil.createMasterThread(c, - this .masterClass, index); + JVMClusterUtil.MasterThread mt = JVMClusterUtil.createMasterThread(c, + ( Class <? extends HMaster>) c.getClass(HConstants.MASTER_IMPL, HMaster.class), index); this .masterThreads.add(mt); What brought on the above change? Was this needed so you could add your mocking tests or is it that you fellas are doing a master subclass? I don't understand why we need this change (because of hbase-5654 ?): default : - throw new IllegalStateException( "Received event is not valid." ); + break ; } Log that we are going to pass on a region here? + // If znode does not exist dont consider this region + if (data == null ) { + continue ; + } At debug level?
        Hide
        Chinna Rao Lalam added a comment -

        Thanks for the review Stack. Patch updated with the debug message.

        -    JVMClusterUtil.MasterThread mt =
        -      JVMClusterUtil.createMasterThread(c,
        -        this.masterClass, index);
        +    JVMClusterUtil.MasterThread mt = JVMClusterUtil.createMasterThread(c,
        +        (Class<? extends HMaster>) c.getClass(HConstants.MASTER_IMPL, HMaster.class), index);
             this.masterThreads.add(mt);
        

        Added the above change for setting the MockMaster class while restarting the master without this change master class is set through the property HConstants.MASTER_IMPL is not reflecting because masterClass is already loaded.

        default:
        -        throw new IllegalStateException("Received event is not valid.");
        +        break;
               }
        

        While restarting the master regions in RS_ZK_REGION_SPLIT or RS_ZK_REGION_SPLITTING state should not be handled in processRIT but here it is throwing the exception because of this it failed. Here it should not throw any exception.

        Show
        Chinna Rao Lalam added a comment - Thanks for the review Stack. Patch updated with the debug message. - JVMClusterUtil.MasterThread mt = - JVMClusterUtil.createMasterThread(c, - this .masterClass, index); + JVMClusterUtil.MasterThread mt = JVMClusterUtil.createMasterThread(c, + ( Class <? extends HMaster>) c.getClass(HConstants.MASTER_IMPL, HMaster.class), index); this .masterThreads.add(mt); Added the above change for setting the MockMaster class while restarting the master without this change master class is set through the property HConstants.MASTER_IMPL is not reflecting because masterClass is already loaded. default : - throw new IllegalStateException( "Received event is not valid." ); + break ; } While restarting the master regions in RS_ZK_REGION_SPLIT or RS_ZK_REGION_SPLITTING state should not be handled in processRIT but here it is throwing the exception because of this it failed. Here it should not throw any exception.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12525962/HBASE-5806_trunk_2.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1794//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1794//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1794//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525962/HBASE-5806_trunk_2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 hadoop23. The patch compiles against the hadoop 0.23.x profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1794//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1794//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1794//console This message is automatically generated.
        Hide
        stack added a comment -

        On the IllegalStateException, should we make a case that catches the RS_ZK_SPLIT* conditions instead of changing the default so that ANY unexpected condition results in a warn log rather than an IllegalStateException?

        I like your change to make it so master can be mocked.

        Show
        stack added a comment - On the IllegalStateException, should we make a case that catches the RS_ZK_SPLIT* conditions instead of changing the default so that ANY unexpected condition results in a warn log rather than an IllegalStateException? I like your change to make it so master can be mocked.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Stack
        So can we add cases for RS_ZK_SPLIT* conditions with a log message with debug level. Because this is expected. If we have WARN, it may attract unnecessary attention.
        And have one default case again with a log message with WARN level?

        Show
        ramkrishna.s.vasudevan added a comment - @Stack So can we add cases for RS_ZK_SPLIT* conditions with a log message with debug level. Because this is expected. If we have WARN, it may attract unnecessary attention. And have one default case again with a log message with WARN level?
        Hide
        stack added a comment -

        So can we add cases for RS_ZK_SPLIT* conditions with a log message with debug level. Because this is expected. If we have WARN, it may attract unnecessary attention. And have one default case again with a log message with WARN level?

        Yes. Add cases for RS_ZK_SPLIT* and log at DEBUG level that these callbacks are just passing through (or do not log if this is 'normal' operation). Yes, I agree, logging at WARN level will make users think this an abnormal state which is not what we want them to think.

        I would then restore the default case throwing IllegalStateException rather than log a WARN. Now your change is more focused on addressing the issue found by Chinna. Previous, the patch could catch illegal states that were other than RS_ZK_SPLIT*

        Show
        stack added a comment - So can we add cases for RS_ZK_SPLIT* conditions with a log message with debug level. Because this is expected. If we have WARN, it may attract unnecessary attention. And have one default case again with a log message with WARN level? Yes. Add cases for RS_ZK_SPLIT* and log at DEBUG level that these callbacks are just passing through (or do not log if this is 'normal' operation). Yes, I agree, logging at WARN level will make users think this an abnormal state which is not what we want them to think. I would then restore the default case throwing IllegalStateException rather than log a WARN. Now your change is more focused on addressing the issue found by Chinna. Previous, the patch could catch illegal states that were other than RS_ZK_SPLIT*
        Hide
        Chinna Rao Lalam added a comment -

        @Stack
        I feel adding cases for RS_ZK_SPLIT* and log with DEBUG level is ok.

        But for other states throwing the IllegalStateException will change the behaviour from the previous. Need to check whether it is ok with the other states other than RS_ZK_SPLIT*. Previous because of throwing exception regions in state RS_ZK_SPLIT* was broken so i need to check thoroughly other states.

        Show
        Chinna Rao Lalam added a comment - @Stack I feel adding cases for RS_ZK_SPLIT* and log with DEBUG level is ok. But for other states throwing the IllegalStateException will change the behaviour from the previous. Need to check whether it is ok with the other states other than RS_ZK_SPLIT*. Previous because of throwing exception regions in state RS_ZK_SPLIT* was broken so i need to check thoroughly other states.
        Hide
        stack added a comment -

        But for other states throwing the IllegalStateException will change the behaviour from the previous.

        How is that? Isn't it this patch that changes the IllegalStateException to instead do LOG.warn when we get to the 'default' case so I'd say all you are changing is the handling of RS_ZK_SPLIT* – which is good since that is all this issue is concerned with?

        Good stuff

        Show
        stack added a comment - But for other states throwing the IllegalStateException will change the behaviour from the previous. How is that? Isn't it this patch that changes the IllegalStateException to instead do LOG.warn when we get to the 'default' case so I'd say all you are changing is the handling of RS_ZK_SPLIT* – which is good since that is all this issue is concerned with? Good stuff
        Hide
        Chinna Rao Lalam added a comment -

        Updated the patch with log message(DEBUG level) for RS_ZK_SPLIT* cases and thrown exception for the default case.

        Show
        Chinna Rao Lalam added a comment - Updated the patch with log message(DEBUG level) for RS_ZK_SPLIT* cases and thrown exception for the default case.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12526183/HBASE-5806_trunk_3.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.regionserver.TestStore
        org.apache.hadoop.hbase.TestDrainingServer
        org.apache.hadoop.hbase.master.TestSplitLogManager

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1818//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1818//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1818//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12526183/HBASE-5806_trunk_3.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 hadoop23. The patch compiles against the hadoop 0.23.x profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestStore org.apache.hadoop.hbase.TestDrainingServer org.apache.hadoop.hbase.master.TestSplitLogManager Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1818//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1818//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1818//console This message is automatically generated.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Pls provide your comments on the latest patch.

        Show
        ramkrishna.s.vasudevan added a comment - Pls provide your comments on the latest patch.
        Hide
        Ted Yu added a comment -

        The 3 tests reported by Hadoop QA passed locally with latest patch.

        +1 from me.

        Show
        Ted Yu added a comment - The 3 tests reported by Hadoop QA passed locally with latest patch. +1 from me.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Thanks Ted.
        @Stack
        Your comments on the patch?

        Show
        ramkrishna.s.vasudevan added a comment - Thanks Ted. @Stack Your comments on the patch?
        Hide
        stack added a comment -

        +1 on on v3.

        For the next time, not meant for this patch, Chinna, you could write the below...

        +      case RS_ZK_REGION_SPLITTING:
        +        LOG.debug("Processed region in state : " + et);
        +        break;
        +      case RS_ZK_REGION_SPLIT:
        +        LOG.debug("Processed region in state : " + et);
        +        break;
        

        as

        +      case RS_ZK_REGION_SPLITTING:
        +      case RS_ZK_REGION_SPLIT:
        +        LOG.debug("Processed region in state : " + et);
        +        break;
        

        ..but lets get v3 in. Good stuff.

        Show
        stack added a comment - +1 on on v3. For the next time, not meant for this patch, Chinna, you could write the below... + case RS_ZK_REGION_SPLITTING: + LOG.debug( "Processed region in state : " + et); + break ; + case RS_ZK_REGION_SPLIT: + LOG.debug( "Processed region in state : " + et); + break ; as + case RS_ZK_REGION_SPLITTING: + case RS_ZK_REGION_SPLIT: + LOG.debug( "Processed region in state : " + et); + break ; ..but lets get v3 in. Good stuff.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Thanks Stack. I will commit it tonight. (India time).

        Show
        ramkrishna.s.vasudevan added a comment - Thanks Stack. I will commit it tonight. (India time).
        Hide
        ramkrishna.s.vasudevan added a comment -

        Committed to trunk, 09.94 and 0.92.
        Thanks for the patch Chinna.
        Thanks for the review Stack and Ted.

        Show
        ramkrishna.s.vasudevan added a comment - Committed to trunk, 09.94 and 0.92. Thanks for the patch Chinna. Thanks for the review Stack and Ted.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #4 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/4/)
        HBASE-5806 Handle split region related failures on master restart and RS restart (Chinna Rao) (Revision 1338325)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #4 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/4/ ) HBASE-5806 Handle split region related failures on master restart and RS restart (Chinna Rao) (Revision 1338325) Result = FAILURE ramkrishna : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92-security #107 (See https://builds.apache.org/job/HBase-0.92-security/107/)
        HBASE-5806 Handle split region related failures on master restart and RS restart(Chinna rao) (Revision 1338335)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java
        Show
        Hudson added a comment - Integrated in HBase-0.92-security #107 (See https://builds.apache.org/job/HBase-0.92-security/107/ ) HBASE-5806 Handle split region related failures on master restart and RS restart(Chinna rao) (Revision 1338335) Result = FAILURE ramkrishna : Files : /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #27 (See https://builds.apache.org/job/HBase-0.94-security/27/)
        HBASE-5806 Handle split region related failures on master restart and RS restart(Chinna rao) (Revision 1338331)

        Result = SUCCESS
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #27 (See https://builds.apache.org/job/HBase-0.94-security/27/ ) HBASE-5806 Handle split region related failures on master restart and RS restart(Chinna rao) (Revision 1338331) Result = SUCCESS ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java

          People

          • Assignee:
            Chinna Rao Lalam
            Reporter:
            ramkrishna.s.vasudevan
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development