HBase
  1. HBase
  2. HBASE-6060

Regions's in OPENING state from failed regionservers takes a long time to recover

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.95.0
    • Component/s: master, regionserver
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      we have seen a pattern in tests, that the regions are stuck in OPENING state for a very long time when the region server who is opening the region fails. My understanding of the process:

      • master calls rs to open the region. If rs is offline, a new plan is generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), HMaster.assign()
      • RegionServer, starts opening a region, changes the state in znode. But that znode is not ephemeral. (see ZkAssign)
      • Rs transitions zk node from OFFLINE to OPENING. See OpenRegionHandler.process()
      • rs then opens the region, and changes znode from OPENING to OPENED
      • when rs is killed between OPENING and OPENED states, then zk shows OPENING state, and the master just waits for rs to change the region state, but since rs is down, that wont happen.
      • There is a AssignmentManager.TimeoutMonitor, which does exactly guard against these kind of conditions. It periodically checks (every 10 sec by default) the regions in transition to see whether they timedout (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, which explains what you and I are seeing.
      • ServerShutdownHandler in Master does not reassign regions in OPENING state, although it handles other states.

      Lowering that threshold from the configuration is one option, but still I think we can do better.

      Will investigate more.

      1. 6060_alternative_suggestion.txt
        2 kB
        stack
      2. 6060_suggestion_based_off_v3.patch
        30 kB
        stack
      3. 6060_suggestion_toassign_rs_wentdown_beforerequest.patch
        10 kB
        rajeshbabu
      4. 6060_suggestion2_based_off_v3.patch
        30 kB
        stack
      5. 6060-94-v3.patch
        27 kB
        Ted Yu
      6. 6060-94-v4_1.patch
        28 kB
        ramkrishna.s.vasudevan
      7. 6060-94-v4_1.patch
        28 kB
        ramkrishna.s.vasudevan
      8. 6060-94-v4.patch
        28 kB
        Ted Yu
      9. 6060-trunk_2.patch
        27 kB
        rajeshbabu
      10. 6060-trunk_3.patch
        26 kB
        rajeshbabu
      11. 6060-trunk.patch
        25 kB
        ramkrishna.s.vasudevan
      12. 6060-trunk.patch
        25 kB
        ramkrishna.s.vasudevan
      13. HBASE-6060_latest.patch
        59 kB
        ramkrishna.s.vasudevan
      14. HBASE-6060_latest.patch
        59 kB
        ramkrishna.s.vasudevan
      15. HBASE-6060_latest.patch
        57 kB
        ramkrishna.s.vasudevan
      16. HBASE-6060_trunk_5.patch
        22 kB
        rajeshbabu
      17. HBASE-6060-92.patch
        14 kB
        rajeshbabu
      18. HBASE-6060-94.patch
        26 kB
        rajeshbabu
      19. HBASE-6060-trunk_4.patch
        14 kB
        rajeshbabu
      20. trunk-6060_v2.patch
        7 kB
        Jimmy Xiang
      21. trunk-6060_v3.3.patch
        22 kB
        Jimmy Xiang
      22. trunk-6060.patch
        4 kB
        Jimmy Xiang

        Issue Links

          Activity

          Hide
          Andrew Purtell added a comment -

          The TimeoutMonitor timeout was increased to 30 minutes in HBASE-4126.

          Show
          Andrew Purtell added a comment - The TimeoutMonitor timeout was increased to 30 minutes in HBASE-4126 .
          Hide
          Enis Soztutar added a comment -

          Thanks Andrew for the pointer. Agreed that lowering the timeout can have deeper impacts. We should fix the issue properly instead.

          Show
          Enis Soztutar added a comment - Thanks Andrew for the pointer. Agreed that lowering the timeout can have deeper impacts. We should fix the issue properly instead.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Eris
          We are working on a patch for that. HBASE-5396 tries to address a similar issue but it is not in trunk and it is only in 0.90. But it caused HBASE-5816 in 0.90.
          We are trying to work on better patch. Will come up with a patch in a day or two.

          Show
          ramkrishna.s.vasudevan added a comment - @Eris We are working on a patch for that. HBASE-5396 tries to address a similar issue but it is not in trunk and it is only in 0.90. But it caused HBASE-5816 in 0.90. We are trying to work on better patch. Will come up with a patch in a day or two.
          Hide
          ramkrishna.s.vasudevan added a comment - - edited

          Today we will come up with a patch for this. We are already working on HBASE-5396 and HBASE-5816

          Show
          ramkrishna.s.vasudevan added a comment - - edited Today we will come up with a patch for this. We are already working on HBASE-5396 and HBASE-5816
          Hide
          Enis Soztutar added a comment -

          @Ramkrishna, that is great. I have also noticed regions in CLOSING to stay in RIT as well, and strangely enough, showing the master as their assigned server. Do you think that it can be related?

          Show
          Enis Soztutar added a comment - @Ramkrishna, that is great. I have also noticed regions in CLOSING to stay in RIT as well, and strangely enough, showing the master as their assigned server. Do you think that it can be related?
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Enis

          I have also noticed regions in CLOSING to stay in RIT as well,

          I have not specifically seen this. We will check this also.
          Pls take a look at HBASE-6016 and HBASE-5927. Just in case if you feel they are related.

          Show
          ramkrishna.s.vasudevan added a comment - @Enis I have also noticed regions in CLOSING to stay in RIT as well, I have not specifically seen this. We will check this also. Pls take a look at HBASE-6016 and HBASE-5927 . Just in case if you feel they are related.
          Hide
          rajeshbabu added a comment - - edited

          This patch is also a backport of HBASE-5396. But this is more exhaustive and also tries to address HBASE-5816.
          HBASE-6147 has been raised to solve other assign related issues that comes from SSH and joincluster. Pls review and provide your comments.

          Show
          rajeshbabu added a comment - - edited This patch is also a backport of HBASE-5396 . But this is more exhaustive and also tries to address HBASE-5816 . HBASE-6147 has been raised to solve other assign related issues that comes from SSH and joincluster. Pls review and provide your comments.
          Hide
          Ted Yu added a comment -
          +      if(!plan.canUsePlan()){
          +        return;
          

          Please insert a space after if. It would be helpful to add LOG.debug() before returning.

          +  public void usePlan(boolean usePlan) {
          +    this.usePlan = usePlan;
          +  }
          

          I would name the boolean 'usable'. The setter can be named setUsable().
          A bigger question is:

          +          if (newPlan) {
          +            randomPlan.usePlan(false);
          +            this.regionPlans.remove(randomPlan.getRegionName());
          +          } else {
          +            existingPlan.usePlan(false);
          +            this.regionPlans.remove(existingPlan.getRegionName());
          +          }
          

          why cannot we return null above so that we don't need to add the boolean member to RegionPlan ?
          At least we shouldn't return an unusable randomPlan.

          Show
          Ted Yu added a comment - + if (!plan.canUsePlan()){ + return ; Please insert a space after if. It would be helpful to add LOG.debug() before returning. + public void usePlan( boolean usePlan) { + this .usePlan = usePlan; + } I would name the boolean 'usable'. The setter can be named setUsable(). A bigger question is: + if (newPlan) { + randomPlan.usePlan( false ); + this .regionPlans.remove(randomPlan.getRegionName()); + } else { + existingPlan.usePlan( false ); + this .regionPlans.remove(existingPlan.getRegionName()); + } why cannot we return null above so that we don't need to add the boolean member to RegionPlan ? At least we shouldn't return an unusable randomPlan.
          Hide
          Ted Yu added a comment -

          I ran the tests in TestAssignmentManager and they passed.

               synchronized (this.regionPlans) {
          +      regionsOnDeadServer = new RegionsOnDeadServer();
          +      regionsFromRegionPlansForServer = new ConcurrentSkipListSet<HRegionInfo>();
          +      this.deadServerRegionsFromRegionPlan.put(sn, regionsOnDeadServer);
          

          Can the first two assignments be placed outside synchronized block ?
          Before making the deadServerRegionsFromRegionPlan.put() call, I think we should check that sn isn't currently in deadServerRegionsFromRegionPlan.
          For isRegionOnline(HRegionInfo hri):

          +        return true;
          +      } else {
          +        // Remove the assignment mapping for sn.
          +        Set<HRegionInfo> hriSet = this.servers.get(sn);
          +        if (hriSet != null) {
          +          hriSet.remove(hri);
          +        }
          

          The else keyword isn't needed.
          What if hriSet contains other regions apart from hri, should they be removed as well ?

          Show
          Ted Yu added a comment - I ran the tests in TestAssignmentManager and they passed. synchronized ( this .regionPlans) { + regionsOnDeadServer = new RegionsOnDeadServer(); + regionsFromRegionPlansForServer = new ConcurrentSkipListSet<HRegionInfo>(); + this .deadServerRegionsFromRegionPlan.put(sn, regionsOnDeadServer); Can the first two assignments be placed outside synchronized block ? Before making the deadServerRegionsFromRegionPlan.put() call, I think we should check that sn isn't currently in deadServerRegionsFromRegionPlan. For isRegionOnline(HRegionInfo hri): + return true ; + } else { + // Remove the assignment mapping for sn. + Set<HRegionInfo> hriSet = this .servers.get(sn); + if (hriSet != null ) { + hriSet.remove(hri); + } The else keyword isn't needed. What if hriSet contains other regions apart from hri, should they be removed as well ?
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Ted

          why cannot we return null above so that we don't need to add the boolean member to RegionPlan ?

          The reason here is incase of null region plan we have a different behaviour. We consider it as a case where we dont have any live RS and hence set some flag such that the timeoutmonitor can be skipped. Hence we need to differentiate the null behaviour from this.

          if (plan == null) {
                  LOG.debug("Unable to determine a plan to assign " + state);
                  this.timeoutMonitor.setAllRegionServersOffline(true);
                  return; // Should get reassigned later when RIT times out.
                }
          

          I was not sure of which name to give for the usePlan. May be 'usable' is better.

          Before making the deadServerRegionsFromRegionPlan.put() call
          

          I think as the SSH calls per server it should be ok.
          I will check on other comments before changing it. Thanks for your detailed review.

          Show
          ramkrishna.s.vasudevan added a comment - @Ted why cannot we return null above so that we don't need to add the boolean member to RegionPlan ? The reason here is incase of null region plan we have a different behaviour. We consider it as a case where we dont have any live RS and hence set some flag such that the timeoutmonitor can be skipped. Hence we need to differentiate the null behaviour from this. if (plan == null ) { LOG.debug( "Unable to determine a plan to assign " + state); this .timeoutMonitor.setAllRegionServersOffline( true ); return ; // Should get reassigned later when RIT times out. } I was not sure of which name to give for the usePlan. May be 'usable' is better. Before making the deadServerRegionsFromRegionPlan.put() call I think as the SSH calls per server it should be ok. I will check on other comments before changing it. Thanks for your detailed review.
          Hide
          Ted Yu added a comment -

          Thanks for working on this issue.

          I will review the next version in more detail

          Show
          Ted Yu added a comment - Thanks for working on this issue. I will review the next version in more detail
          Hide
          Ted Yu added a comment -

          Thinking more about the usable RegionPlan flag, we don't really need it.
          We can introduce an 'unusable' RegionPlan singleton which signifies the fact that it is not to be used.

          Show
          Ted Yu added a comment - Thinking more about the usable RegionPlan flag, we don't really need it. We can introduce an 'unusable' RegionPlan singleton which signifies the fact that it is not to be used.
          Hide
          Ted Yu added a comment -

          Patch v3 illustrates my proposal.

          I also created a singleton for the null RegionPlan that signifies there is no server to assign region.

          TestAssignmentManager passes.

          Show
          Ted Yu added a comment - Patch v3 illustrates my proposal. I also created a singleton for the null RegionPlan that signifies there is no server to assign region. TestAssignmentManager passes.
          Hide
          rajeshbabu added a comment -

          @Ted
          in v3 patch small change.

                RegionPlan plan = getRegionPlan(state, forceNewPlan);
                if (plan == null) {
                  LOG.debug("Unable to determine a plan to assign " + state);
                  this.timeoutMonitor.setAllRegionServersOffline(true);
                  return; // Should get reassigned later when RIT times out.
                }
          
          

          in this place also instead of null check, we need to give

          plan == RegionPlan.NO_SERVERS_TO_ASSIGN
          

          Thanks.

          Show
          rajeshbabu added a comment - @Ted in v3 patch small change. RegionPlan plan = getRegionPlan(state, forceNewPlan); if (plan == null ) { LOG.debug( "Unable to determine a plan to assign " + state); this .timeoutMonitor.setAllRegionServersOffline( true ); return ; // Should get reassigned later when RIT times out. } in this place also instead of null check, we need to give plan == RegionPlan.NO_SERVERS_TO_ASSIGN Thanks.
          Hide
          Ted Yu added a comment -

          Patch v4 addresses Rajesh's comment and some of my own comments.

          TestAssignmentManager passes.

          Running test suite.

          Show
          Ted Yu added a comment - Patch v4 addresses Rajesh's comment and some of my own comments. TestAssignmentManager passes. Running test suite.
          Hide
          ramkrishna.s.vasudevan added a comment -

          +1 on v4. Hope test suite passes.

          Show
          ramkrishna.s.vasudevan added a comment - +1 on v4. Hope test suite passes.
          Hide
          Ted Yu added a comment -

          Test suite passed.

          Show
          Ted Yu added a comment - Test suite passed.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Attached patch for trunk and 0.94. In 0.94 just added one info log from the previous patch. Will submit to hadoopQA.

          Show
          ramkrishna.s.vasudevan added a comment - Attached patch for trunk and 0.94. In 0.94 just added one info log from the previous patch. Will submit to hadoopQA.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12530682/6060-trunk.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.client.TestMultiParallel
          org.apache.hadoop.hbase.replication.TestReplication
          org.apache.hadoop.hbase.regionserver.TestHRegionOnCluster
          org.apache.hadoop.hbase.master.TestRollingRestart

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2091//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2091//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12530682/6060-trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestMultiParallel org.apache.hadoop.hbase.replication.TestReplication org.apache.hadoop.hbase.regionserver.TestHRegionOnCluster org.apache.hadoop.hbase.master.TestRollingRestart Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2091//testReport/ Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2091//console This message is automatically generated.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Updated patch.

          Show
          ramkrishna.s.vasudevan added a comment - Updated patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12530688/6060-trunk.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks
          org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2092//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2092//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12530688/6060-trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2092//testReport/ Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2092//console This message is automatically generated.
          Hide
          Ted Yu added a comment -

          When I tried to produce a patch for 0.92 branch, I found that HBASE-5546 wasn't integrated to 0.92

          @Ram:
          Do you think HBASE-5546 and this JIRA should go to 0.92 ?

          Show
          Ted Yu added a comment - When I tried to produce a patch for 0.92 branch, I found that HBASE-5546 wasn't integrated to 0.92 @Ram: Do you think HBASE-5546 and this JIRA should go to 0.92 ?
          Hide
          rajeshbabu added a comment -

          @Ram
          Few changes in TestAssignmentManager.
          ->In mocked assignment manager we are overriding assign method and if we invoke this we are setting assignInvoked to true.
          But in bulk assign case if regions to be assigned is empty then we should not set assignInvoked to true because in SSH we are always calling assign even regions is empty.

              @Override
              public void assign(java.util.List<HRegionInfo> regions, java.util.List<ServerName> servers) 
              {
                if (!regions.isEmpty()) {
                  assignInvoked = true;
                } else {
                  assignInvoked = false;
                }
              };
          

          ->processServerShutdownHandler should behave based on servername we are passing.changed as in 94 patch.

              ServerShutdownHandler handler = null;
              if (sn != null) {
                handler = new ServerShutdownHandler(this.server, services, deadServers, sn, false);
              } else {
                handler = new ServerShutdownHandler(this.server, services, deadServers, SERVERNAME_A, false);
              }
          

          Uploaded patch for trunk with these changes.
          TestAssignmentManager passes.

          Show
          rajeshbabu added a comment - @Ram Few changes in TestAssignmentManager. ->In mocked assignment manager we are overriding assign method and if we invoke this we are setting assignInvoked to true. But in bulk assign case if regions to be assigned is empty then we should not set assignInvoked to true because in SSH we are always calling assign even regions is empty. @Override public void assign(java.util.List<HRegionInfo> regions, java.util.List<ServerName> servers) { if (!regions.isEmpty()) { assignInvoked = true ; } else { assignInvoked = false ; } }; ->processServerShutdownHandler should behave based on servername we are passing.changed as in 94 patch. ServerShutdownHandler handler = null ; if (sn != null ) { handler = new ServerShutdownHandler( this .server, services, deadServers, sn, false ); } else { handler = new ServerShutdownHandler( this .server, services, deadServers, SERVERNAME_A, false ); } Uploaded patch for trunk with these changes. TestAssignmentManager passes.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Ted
          I feel this JIRA is needed for 0.92 as it is related to the recovery time. But HBASE-5546 may not be so important. May be if someone needs it they can just backport it i feel.

          Show
          ramkrishna.s.vasudevan added a comment - @Ted I feel this JIRA is needed for 0.92 as it is related to the recovery time. But HBASE-5546 may not be so important. May be if someone needs it they can just backport it i feel.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12530693/6060-trunk_2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.TestAtomicOperation
          org.apache.hadoop.hbase.master.TestSplitLogManager

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2093//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2093//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12530693/6060-trunk_2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestAtomicOperation org.apache.hadoop.hbase.master.TestSplitLogManager Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2093//testReport/ Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2093//console This message is automatically generated.
          Hide
          Ted Yu added a comment -

          @Ram:
          That is fine.

          @Rajesh:
          Can you prepare patch for 0.92 (and run test suite) ? In the future, publishing trunk patch on review board would allow reviewers to easily spot the differences across patches.

          Show
          Ted Yu added a comment - @Ram: That is fine. @Rajesh: Can you prepare patch for 0.92 (and run test suite) ? In the future, publishing trunk patch on review board would allow reviewers to easily spot the differences across patches.
          Hide
          rajeshbabu added a comment -

          @Ted

          Ok. I will prepare and run test suite.

          Show
          rajeshbabu added a comment - @Ted Ok. I will prepare and run test suite.
          Hide
          Ted Yu added a comment -

          I ran the two tests flagged by Hadoop QA and they passed.

          Show
          Ted Yu added a comment - I ran the two tests flagged by Hadoop QA and they passed.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Rajesh
          Thanks for that testcase change. I forgot doing that, i remember we discussed that change internally.

          Show
          ramkrishna.s.vasudevan added a comment - @Rajesh Thanks for that testcase change. I forgot doing that, i remember we discussed that change internally.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Ted
          In trunk if we take up this patch we need to solve HBASE-6012. Without that master will get aborted. But i feel the current patch in HBASE-6012 needs more testing.

          Show
          ramkrishna.s.vasudevan added a comment - @Ted In trunk if we take up this patch we need to solve HBASE-6012 . Without that master will get aborted. But i feel the current patch in HBASE-6012 needs more testing.
          Hide
          ramkrishna.s.vasudevan added a comment -

          I also feel if this patch is fine we can commit the patch to trunk also and work on HBASe-6012 and HBASE-6147 to solve such scenarios mentioned in the above comment.

          Show
          ramkrishna.s.vasudevan added a comment - I also feel if this patch is fine we can commit the patch to trunk also and work on HBASe-6012 and HBASE-6147 to solve such scenarios mentioned in the above comment.
          Hide
          Ted Yu added a comment -

          Agreed.
          We can integrate once patch for 0.92 is ready.

          Show
          Ted Yu added a comment - Agreed. We can integrate once patch for 0.92 is ready.
          Hide
          rajeshbabu added a comment -

          Patch for 92. Running test suite.

          Show
          rajeshbabu added a comment - Patch for 92. Running test suite.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12530816/HBASE-6060-92.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2098//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12530816/HBASE-6060-92.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2098//console This message is automatically generated.
          Hide
          stack added a comment -

          Lads, signaling using plan types seems perverse. Why do we need UNUSABLE_PLAN? How is that different from the absence of a plan? And why do we need NO_SERVERS_TO_ASSIGN? Can't we just check for there being servers before we assign?

          Do we have to make the new internal data structure accessible to others publically? I'm talking about the method: 'getDeadServerRegionsFromRegionPlan'.

          ...

          More later.

          Show
          stack added a comment - Lads, signaling using plan types seems perverse. Why do we need UNUSABLE_PLAN? How is that different from the absence of a plan? And why do we need NO_SERVERS_TO_ASSIGN? Can't we just check for there being servers before we assign? Do we have to make the new internal data structure accessible to others publically? I'm talking about the method: 'getDeadServerRegionsFromRegionPlan'. ... More later.
          Hide
          stack added a comment -

          I see the question of difference between null plan and UNUSABLE_PLAN was raised already. Null plan means no servers apparently. Does that have to be so? That don't seem right either.

          Nice tests.

          Show
          stack added a comment - I see the question of difference between null plan and UNUSABLE_PLAN was raised already. Null plan means no servers apparently. Does that have to be so? That don't seem right either. Nice tests.
          Hide
          rajeshbabu added a comment -

          @stack
          Thanks for review.

          Do we have to make the new internal data structure accessible to others publically? I'm talking about the method: 'getDeadServerRegionsFromRegionPlan'.

          deadServerRegionsFromRegionPlan is not accessible to SSH.
          made the getDeadServerRegionsFromRegionPlan public to clear the entries of server from deadServerRegionsFromRegionPlan after processing SSH.

              } finally {
                if (regionsFromRegionPlansForServer != null) {
                  regionsFromRegionPlansForServer.clear();
                }
                if (regionsInTransition != null) {
                  regionsInTransition.clear();
                }
                this.services.getAssignmentManager().getDeadServerRegionsFromRegionPlan()
                    .remove(this.serverName);
                this.deadServers.finish(serverName);
              }
          
          

          please correct me if wrong

          Show
          rajeshbabu added a comment - @stack Thanks for review. Do we have to make the new internal data structure accessible to others publically? I'm talking about the method: 'getDeadServerRegionsFromRegionPlan'. deadServerRegionsFromRegionPlan is not accessible to SSH. made the getDeadServerRegionsFromRegionPlan public to clear the entries of server from deadServerRegionsFromRegionPlan after processing SSH. } finally { if (regionsFromRegionPlansForServer != null ) { regionsFromRegionPlansForServer.clear(); } if (regionsInTransition != null ) { regionsInTransition.clear(); } this .services.getAssignmentManager().getDeadServerRegionsFromRegionPlan() .remove( this .serverName); this .deadServers.finish(serverName); } please correct me if wrong
          Hide
          rajeshbabu added a comment -

          @Ted
          Test suite passes for 92

          Show
          rajeshbabu added a comment - @Ted Test suite passes for 92
          Hide
          stack added a comment -

          Rajeshbabu If you are just doing clear and remove, then maybe expose a clear and remove only rather than give out the Map for anyone to do anything to. Also, any chance of reworking this patch so we don't have 'special' regionplans to signal different processing? Thanks.

          Show
          stack added a comment - Rajeshbabu If you are just doing clear and remove, then maybe expose a clear and remove only rather than give out the Map for anyone to do anything to. Also, any chance of reworking this patch so we don't have 'special' regionplans to signal different processing? Thanks.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Also, any chance of reworking this patch so we don't have 'special' regionplans to signal different processing

          NO_SERVERS_TO_ASSIGN - I think we can remove this and use null as it was previously.
          But to signify if to use the plan or not, its better we use some region plan, may be say REGION_PLAN_INUSE. The reason is, in future if we work on HBASE-6147, such states can be useful.(this what we feel currently, but not started working on that).

          Show
          ramkrishna.s.vasudevan added a comment - Also, any chance of reworking this patch so we don't have 'special' regionplans to signal different processing NO_SERVERS_TO_ASSIGN - I think we can remove this and use null as it was previously. But to signify if to use the plan or not, its better we use some region plan, may be say REGION_PLAN_INUSE. The reason is, in future if we work on HBASE-6147 , such states can be useful.(this what we feel currently, but not started working on that).
          Hide
          rajeshbabu added a comment -

          @Stack

          then maybe expose a clear and remove only rather than give out the Map

          Done

          NO_SERVERS_TO_ASSIGN has been removed. But we have one case saying REGION_PLAN_ALREADY_INUSE to differnetiate null and the special case. It may be better to have if in future we try to solve double assignment scenarios w.r.t so SSH and AM. Pls provide your comments. Any other better way you feel?
          If you are ok, I can rebase the patches for 0.92 and 0.94.

          Show
          rajeshbabu added a comment - @Stack then maybe expose a clear and remove only rather than give out the Map Done NO_SERVERS_TO_ASSIGN has been removed. But we have one case saying REGION_PLAN_ALREADY_INUSE to differnetiate null and the special case. It may be better to have if in future we try to solve double assignment scenarios w.r.t so SSH and AM. Pls provide your comments. Any other better way you feel? If you are ok, I can rebase the patches for 0.92 and 0.94.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12530962/6060-trunk_3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.master.TestSplitLogManager
          org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2107//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2107//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2107//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2107//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12530962/6060-trunk_3.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.master.TestSplitLogManager org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2107//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2107//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2107//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2107//console This message is automatically generated.
          Hide
          Ted Yu added a comment -
          +          if (newPlan) {
          +            this.regionPlans.remove(randomPlan.getRegionName());
          +            LOG
          +            .info("Server shutdown handler already in progress for the region "
          +                + randomPlan.getRegionName());
          +            randomPlan = RegionPlan.REGION_PLAN_ALREADY_INUSE;
          

          It would be confusing to label a new plan 'ALREADY_INUSE'.

          +  // the following singleton signifies that the plan is not usable
          +  static final RegionPlan REGION_PLAN_ALREADY_INUSE = new RegionPlan(null, null, null);
          

          I think UNUSABLE_REGION_PLAN would be a better name.

          Show
          Ted Yu added a comment - + if (newPlan) { + this .regionPlans.remove(randomPlan.getRegionName()); + LOG + .info( "Server shutdown handler already in progress for the region " + + randomPlan.getRegionName()); + randomPlan = RegionPlan.REGION_PLAN_ALREADY_INUSE; It would be confusing to label a new plan 'ALREADY_INUSE'. + // the following singleton signifies that the plan is not usable + static final RegionPlan REGION_PLAN_ALREADY_INUSE = new RegionPlan( null , null , null ); I think UNUSABLE_REGION_PLAN would be a better name.
          Hide
          stack added a comment -

          First, thanks for working on this awkward issue. Its a good find and needs to be fixed.

          Thanks too for removing NO_SERVERS_TO_ASSIGN.

          I'm trying to get clear on why we need the REGION_PLAN_INUSE at all. The idea is that the call to getRegionPlan needs to return something which says "do not assign this region just yet" and somehow, returning null is not enough because that means something else? It means that there are no servers available to assign the region too? Why can we not broaden what null means so it means no servers or don't assign just yet?

          FYI, this will confuse: '+ if (this == REGION_PLAN_ALREADY_INUSE) return "";' Should probably print out something here. Null string will make people think the toSTring doesn't work.

          Show
          stack added a comment - First, thanks for working on this awkward issue. Its a good find and needs to be fixed. Thanks too for removing NO_SERVERS_TO_ASSIGN. I'm trying to get clear on why we need the REGION_PLAN_INUSE at all. The idea is that the call to getRegionPlan needs to return something which says "do not assign this region just yet" and somehow, returning null is not enough because that means something else? It means that there are no servers available to assign the region too? Why can we not broaden what null means so it means no servers or don't assign just yet? FYI, this will confuse: '+ if (this == REGION_PLAN_ALREADY_INUSE) return "";' Should probably print out something here. Null string will make people think the toSTring doesn't work.
          Hide
          ramkrishna.s.vasudevan added a comment -
          final String encodedName = state.getRegion().getEncodedName();
              final List<ServerName> destServers =
                serverManager.createDestinationServersList(serverToExclude);
          
              if (destServers.isEmpty()){
                LOG.warn("Can't move the region " + encodedName +
                  ", there is no destination server available.");
                return null;
              }
          

          Thought we can extract this out before calling getRegionPlan so that if we get a null we know there are no servers to assign.
          After this we should use the destServers and pass it to getRegionPlan. Now if getRegionPlan gives a null then we are sure that dont assign the regions yet.
          But now if someone needs to use getRegionPlan they need to make a call to the code above know the result and then use getRegionPlan passing the destServers (which again can be null).
          If you see the patch attached earlier we tried to use a boolean var in regionplan to know if we can use the plan or not.
          Can we try to make the destination as null if we want to say dont assign the region yet?

          Show
          ramkrishna.s.vasudevan added a comment - final String encodedName = state.getRegion().getEncodedName(); final List<ServerName> destServers = serverManager.createDestinationServersList(serverToExclude); if (destServers.isEmpty()){ LOG.warn( "Can't move the region " + encodedName + ", there is no destination server available." ); return null ; } Thought we can extract this out before calling getRegionPlan so that if we get a null we know there are no servers to assign. After this we should use the destServers and pass it to getRegionPlan. Now if getRegionPlan gives a null then we are sure that dont assign the regions yet. But now if someone needs to use getRegionPlan they need to make a call to the code above know the result and then use getRegionPlan passing the destServers (which again can be null). If you see the patch attached earlier we tried to use a boolean var in regionplan to know if we can use the plan or not. Can we try to make the destination as null if we want to say dont assign the region yet?
          Hide
          stack added a comment -

          Why move this out of getRegionPlan? Don't we want getRegionPlan to return actionable plans? If no servers to assign, then a plan is not possible so just return null? Why is it important to distingush case of no servers? Because we do not have a mechanism for parking regions to assign until a server comes up? That seems like something we need? Can the region hang out in zk and be retried? Thats the timeout monitor's job? But we dont' want timeout monitor and SSH racing against each other? Can't we use zk znode sequence numbers to prevent this? Just throwing out some ideas Ram.

          Show
          stack added a comment - Why move this out of getRegionPlan? Don't we want getRegionPlan to return actionable plans? If no servers to assign, then a plan is not possible so just return null? Why is it important to distingush case of no servers? Because we do not have a mechanism for parking regions to assign until a server comes up? That seems like something we need? Can the region hang out in zk and be retried? Thats the timeout monitor's job? But we dont' want timeout monitor and SSH racing against each other? Can't we use zk znode sequence numbers to prevent this? Just throwing out some ideas Ram.
          Hide
          chunhui shen added a comment -

          Could we fix the problem as the following?:
          AssignmentManager#processServerShutdown

          public List<RegionState> processServerShutdown(final ServerName sn){
          ...
          
          synchronized (regionsInTransition) {
                for (RegionState regionState : this.regionsInTransition.values()) {
                  if (deadRegions.remove(regionState.getRegion())) {
                    rits.add(regionState);
                  }
                  // make this region in RIT will time out at once
                  if (sn.equals(regionState.getServerName())
                      && (regionState.isOpening() || regionState.isPendingOpen())) {
                    LOG.debug(sn + " is already dead and make rs=" + regionState
                        + " time out in RIT");
                    regionState.updateTimestamp(0L);
                  }
                }
              }
              return rits;
          }
          
          

          I only see the description, correct me if wrong, thanks.

          Show
          chunhui shen added a comment - Could we fix the problem as the following?: AssignmentManager#processServerShutdown public List<RegionState> processServerShutdown( final ServerName sn){ ... synchronized (regionsInTransition) { for (RegionState regionState : this .regionsInTransition.values()) { if (deadRegions.remove(regionState.getRegion())) { rits.add(regionState); } // make this region in RIT will time out at once if (sn.equals(regionState.getServerName()) && (regionState.isOpening() || regionState.isPendingOpen())) { LOG.debug(sn + " is already dead and make rs=" + regionState + " time out in RIT" ); regionState.updateTimestamp(0L); } } } return rits; } I only see the description, correct me if wrong, thanks.
          Hide
          stack added a comment -

          @Chunhui That is inventive. It has merit in that we only do the processing in the one place and no new special states are introduced. Does it work?

          Show
          stack added a comment - @Chunhui That is inventive. It has merit in that we only do the processing in the one place and no new special states are introduced. Does it work?
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Stack and @Chunhui
          As a whole if you see this patch it is a rebase of HBASE-5396. If you see HBASE-5396 (now in 0.90) there are no special states. But the fix there caused HBASE-5816.
          Now Chunhui's suggestion will make the timeout to work but the problem in HBASE-5816 will easily occur i.e two assignments will go in parallel. And every time you hit this problem there are high chances master will abort.

              if (!hijack && !state.isClosed() && !state.isOffline()) {
                String msg = "Unexpected state : " + state + " .. Cannot transit it to OFFLINE.";
                this.master.abort(msg, new IllegalStateException(msg));
                return -1;
              }
          

          Why this double assignment happens? It is because
          In the assign() code there is a retry mechanism if we fail the first assign.

          LOG.warn("Failed assignment of " +
                    state.getRegion().getRegionNameAsString() + " to " +
                    plan.getDestination() + ", trying to assign elsewhere instead; " +
                    "retry=" + i, t);
                  // Clean out plan we failed execute and one that doesn't look like it'll
                  // succeed anyways; we need a new plan!
                  // Transition back to OFFLINE
                  state.update(RegionState.State.OFFLINE);
                  // Force a new plan and reassign.  Will return null if no servers.
                  if (getRegionPlan(state, plan.getDestination(), true) == null) {
                    this.timeoutMonitor.setAllRegionServersOffline(true);
                    LOG.warn("Unable to find a viable location to assign region " +
                      state.getRegion().getRegionNameAsString());
                    return;
                  }
          

          Now because of the retry we start an assignment and SSH also will try to do an assignement(Either with the soln in patch or with Chunhui's suggestion). Now this will cause the master abort to happen. And the same was reported in HBASE-5816 by Maryann. This is the reason we were trying to address both the problems as part of this patch with some workarounds. Also we try to use such mechanism so that in future we can avoid double assignments.

          Can't we use zk znode sequence numbers to prevent this

          I am not very clear on this. I think before setting a node to OFFLINE already we have some code checking the version no of znode inside createOrForceNodeOffline().

          Show
          ramkrishna.s.vasudevan added a comment - @Stack and @Chunhui As a whole if you see this patch it is a rebase of HBASE-5396 . If you see HBASE-5396 (now in 0.90) there are no special states. But the fix there caused HBASE-5816 . Now Chunhui's suggestion will make the timeout to work but the problem in HBASE-5816 will easily occur i.e two assignments will go in parallel. And every time you hit this problem there are high chances master will abort. if (!hijack && !state.isClosed() && !state.isOffline()) { String msg = "Unexpected state : " + state + " .. Cannot transit it to OFFLINE." ; this .master.abort(msg, new IllegalStateException(msg)); return -1; } Why this double assignment happens? It is because In the assign() code there is a retry mechanism if we fail the first assign. LOG.warn( "Failed assignment of " + state.getRegion().getRegionNameAsString() + " to " + plan.getDestination() + ", trying to assign elsewhere instead; " + "retry=" + i, t); // Clean out plan we failed execute and one that doesn't look like it'll // succeed anyways; we need a new plan! // Transition back to OFFLINE state.update(RegionState.State.OFFLINE); // Force a new plan and reassign. Will return null if no servers. if (getRegionPlan(state, plan.getDestination(), true ) == null ) { this .timeoutMonitor.setAllRegionServersOffline( true ); LOG.warn( "Unable to find a viable location to assign region " + state.getRegion().getRegionNameAsString()); return ; } Now because of the retry we start an assignment and SSH also will try to do an assignement(Either with the soln in patch or with Chunhui's suggestion). Now this will cause the master abort to happen. And the same was reported in HBASE-5816 by Maryann. This is the reason we were trying to address both the problems as part of this patch with some workarounds. Also we try to use such mechanism so that in future we can avoid double assignments. Can't we use zk znode sequence numbers to prevent this I am not very clear on this. I think before setting a node to OFFLINE already we have some code checking the version no of znode inside createOrForceNodeOffline().
          Hide
          stack added a comment -

          @Ram I looked at Chunhui's suggestion and read it as a fix in SSH that avoided our doing assigns in SSH but rather we punted to the timeout monitor letting it assign. I now see after your explanation above that this is a change in AM#processServerShutdown. Pardon me. Thanks for the reminder on whats going on here and also for the reminder on hbase-5816.

          The issue in hbase-5816 seems to be the one to fix first. Making the timeout monitor run more promptly will always make us bump up against hbase-5816 (our current config just mitigates how many threads can run at the one time manipulating regions. Its not a fix by any means).

          Now because of the retry we start an assignment and SSH also will try to do an assignement(Either with the soln in patch or with Chunhui's suggestion). Now this will cause the master abort to happen.

          If we are to continue to have multiple threads doing assign, then we need to have each claim ownership of the region first by writing zk and then verifying the seqid each time they move the region state. If they fali to grab the znode, then someone else is operating on it and they should give up.

          Are you thinking that the patches posted earlier address this issue and hbase-5816?

          Show
          stack added a comment - @Ram I looked at Chunhui's suggestion and read it as a fix in SSH that avoided our doing assigns in SSH but rather we punted to the timeout monitor letting it assign. I now see after your explanation above that this is a change in AM#processServerShutdown. Pardon me. Thanks for the reminder on whats going on here and also for the reminder on hbase-5816. The issue in hbase-5816 seems to be the one to fix first. Making the timeout monitor run more promptly will always make us bump up against hbase-5816 (our current config just mitigates how many threads can run at the one time manipulating regions. Its not a fix by any means). Now because of the retry we start an assignment and SSH also will try to do an assignement(Either with the soln in patch or with Chunhui's suggestion). Now this will cause the master abort to happen. If we are to continue to have multiple threads doing assign, then we need to have each claim ownership of the region first by writing zk and then verifying the seqid each time they move the region state. If they fali to grab the znode, then someone else is operating on it and they should give up. Are you thinking that the patches posted earlier address this issue and hbase-5816?
          Hide
          ramkrishna.s.vasudevan added a comment -

          Are you thinking that the patches posted earlier address this issue and hbase-5816?

          Yes. It addresses HBASe-5396, HBASE-6060 and HBASe-5816.

          If we are to continue to have multiple threads doing assign, then we need to have each claim ownership of the region first by writing zk and then verifying the seqid each time they move the region state. If they fali to grab the znode, then someone else is operating on it and they should give up.

          This is already happening as per the current code. Before master tries to write the data on the znode there is a version check. Even when RS tries to update the znode there is a version check. But on top of this there are also some inmemory states in master which is getting updated by various threads. And that will also lead to some abnormalities like

          if (!hijack && !state.isClosed() && !state.isOffline()) {
                String msg = "Unexpected state : " + state + " .. Cannot transit it to OFFLINE.";
                this.master.abort(msg, new IllegalStateException(msg));
                return -1;
              }
          

          leading to issues like HBASE-5816. One more thing if you see in the patches attached here, logic of manipulating the region plan happens in the synchronized(regionplans) block. Thus helping us to make only one assign thread to succeed. Either AM or SSH.

          Show
          ramkrishna.s.vasudevan added a comment - Are you thinking that the patches posted earlier address this issue and hbase-5816? Yes. It addresses HBASe-5396, HBASE-6060 and HBASe-5816. If we are to continue to have multiple threads doing assign, then we need to have each claim ownership of the region first by writing zk and then verifying the seqid each time they move the region state. If they fali to grab the znode, then someone else is operating on it and they should give up. This is already happening as per the current code. Before master tries to write the data on the znode there is a version check. Even when RS tries to update the znode there is a version check. But on top of this there are also some inmemory states in master which is getting updated by various threads. And that will also lead to some abnormalities like if (!hijack && !state.isClosed() && !state.isOffline()) { String msg = "Unexpected state : " + state + " .. Cannot transit it to OFFLINE." ; this .master.abort(msg, new IllegalStateException(msg)); return -1; } leading to issues like HBASE-5816 . One more thing if you see in the patches attached here, logic of manipulating the region plan happens in the synchronized(regionplans) block. Thus helping us to make only one assign thread to succeed. Either AM or SSH.
          Hide
          stack added a comment -

          On znode check, yes, the rs is doing the checks before it does any transition. This is not the case in master when SSH or TimeoutMonitor assume they own a region (else we would not be running into illegalstateexceptions).

          Agree, the updating of in-memory states also needs review.

          Synchronize on regionplans making change will shut out other threads while regionplan is being manipulated but after we leave the synchronization, is having a region named REGION_PLAN_ALREADY_INUSE the only way we have of blocking another thread trying to operate on a region? Can we not check RIT or can we not look at the RegionPlan to see if its in process being assigned (we could add another field to RegionPlan with where it is in the assignment process?)

          I'll rereview your last patch in morning w/ the above in mind. Thanks for bringing me along.

          Show
          stack added a comment - On znode check, yes, the rs is doing the checks before it does any transition. This is not the case in master when SSH or TimeoutMonitor assume they own a region (else we would not be running into illegalstateexceptions). Agree, the updating of in-memory states also needs review. Synchronize on regionplans making change will shut out other threads while regionplan is being manipulated but after we leave the synchronization, is having a region named REGION_PLAN_ALREADY_INUSE the only way we have of blocking another thread trying to operate on a region? Can we not check RIT or can we not look at the RegionPlan to see if its in process being assigned (we could add another field to RegionPlan with where it is in the assignment process?) I'll rereview your last patch in morning w/ the above in mind. Thanks for bringing me along.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Thanks Stack.
          Just to give more details so that you can review with these points also in mind.
          When SSH starts processing

                  this.services.getAssignmentManager().
                    processServerShutdown(this.serverName);
          
               synchronized (this.regionPlans) {
          +      this.deadServerRegionsFromRegionPlan.put(sn, regionsOnDeadServer);
                 for (Iterator <Map.Entry<String, RegionPlan>> i =
                     this.regionPlans.entrySet().iterator(); i.hasNext();) {
                   Map.Entry<String, RegionPlan> e = i.next();
          @@ -3210,9 +3265,11 @@
                   // The name will be null if the region is planned for a random assign.
                   if (otherSn != null && otherSn.equals(sn)) {
                     // Use iterator's remove else we'll get CME
          +          regionsFromRegionPlansForServer.add(e.getValue().getRegionInfo());
                     i.remove();
                   }
                 }
          +      regionsOnDeadServer.setRegionsFromRegionPlansForServer(regionsFromRegionPlansForServer);
          

          If the SSH first sees that the region plan was part of the destination that had gone down, as per the patch we add the region plan into the new datastructure.

          This time the getREgionPlan will be blocked as it is synchronzied.
          Now after this step when getRegionPlan proceeds we again synchronize on regionplans

          synchronized (this.regionPlans) {
          .....
          .....
          +      if (serverToExclude != null) {
          +        RegionsOnDeadServer regionsOnDeadServer = this.deadServerRegionsFromRegionPlan
          +            .get(serverToExclude);
          +        if (regionsOnDeadServer != null
          +            && regionsOnDeadServer.getRegionsFromRegionPlansForServer().
          +            contains(state.getRegion())) {
          +          if (newPlan) {
          +            this.regionPlans.remove(randomPlan.getRegionName());
          +            LOG
          +            .info("Server shutdown handler already in progress for the region "
          +                + randomPlan.getRegionName());
          +            randomPlan = RegionPlan.REGION_PLAN_ALREADY_INUSE;
          +          } else {
          +            this.regionPlans.remove(existingPlan.getRegionName());
          +            LOG
          +            .info("Server shutdown handler already in progress for the region "
          +                + existingPlan.getRegionName());
          +            existingPlan = RegionPlan.REGION_PLAN_ALREADY_INUSE;
          +          }
          +        }
          +      }
          

          Now if the getRegionPlan will see that already SSH populated the datastructure and so he will say do not use this plan. And hence the assign does not happen.
          Next if suppose getRegionPlan happens first, any way he will populate a new region plan and add it into the region plan and for sure it will not be the server that has gone down becuase it will not be in onlineServerlist.
          Now after this if SSH processServerShutdown starts, when he starts to iterate the region Plan he will not find the destination server as the dead server and hence SSH need not bother about it.

          (we could add another field to RegionPlan with where it is in the assignment process?)

          In our first patch we have used some boolean flag to say if to use the plan or not. Later changed it to special states.

          Show
          ramkrishna.s.vasudevan added a comment - Thanks Stack. Just to give more details so that you can review with these points also in mind. When SSH starts processing this .services.getAssignmentManager(). processServerShutdown( this .serverName); synchronized ( this .regionPlans) { + this .deadServerRegionsFromRegionPlan.put(sn, regionsOnDeadServer); for (Iterator <Map.Entry< String , RegionPlan>> i = this .regionPlans.entrySet().iterator(); i.hasNext();) { Map.Entry< String , RegionPlan> e = i.next(); @@ -3210,9 +3265,11 @@ // The name will be null if the region is planned for a random assign. if (otherSn != null && otherSn.equals(sn)) { // Use iterator's remove else we'll get CME + regionsFromRegionPlansForServer.add(e.getValue().getRegionInfo()); i.remove(); } } + regionsOnDeadServer.setRegionsFromRegionPlansForServer(regionsFromRegionPlansForServer); If the SSH first sees that the region plan was part of the destination that had gone down, as per the patch we add the region plan into the new datastructure. This time the getREgionPlan will be blocked as it is synchronzied. Now after this step when getRegionPlan proceeds we again synchronize on regionplans synchronized ( this .regionPlans) { ..... ..... + if (serverToExclude != null ) { + RegionsOnDeadServer regionsOnDeadServer = this .deadServerRegionsFromRegionPlan + .get(serverToExclude); + if (regionsOnDeadServer != null + && regionsOnDeadServer.getRegionsFromRegionPlansForServer(). + contains(state.getRegion())) { + if (newPlan) { + this .regionPlans.remove(randomPlan.getRegionName()); + LOG + .info( "Server shutdown handler already in progress for the region " + + randomPlan.getRegionName()); + randomPlan = RegionPlan.REGION_PLAN_ALREADY_INUSE; + } else { + this .regionPlans.remove(existingPlan.getRegionName()); + LOG + .info( "Server shutdown handler already in progress for the region " + + existingPlan.getRegionName()); + existingPlan = RegionPlan.REGION_PLAN_ALREADY_INUSE; + } + } + } Now if the getRegionPlan will see that already SSH populated the datastructure and so he will say do not use this plan. And hence the assign does not happen. Next if suppose getRegionPlan happens first, any way he will populate a new region plan and add it into the region plan and for sure it will not be the server that has gone down becuase it will not be in onlineServerlist. Now after this if SSH processServerShutdown starts, when he starts to iterate the region Plan he will not find the destination server as the dead server and hence SSH need not bother about it. (we could add another field to RegionPlan with where it is in the assignment process?) In our first patch we have used some boolean flag to say if to use the plan or not. Later changed it to special states.
          Hide
          chunhui shen added a comment -

          @ram
          Thanks to explain the detail about double assignment.

          Is there any availability that only one thread doing assign/unassign work?
          I mean we only allow one thread to handle RIT and other threads only add region to RIT

          Show
          chunhui shen added a comment - @ram Thanks to explain the detail about double assignment. Is there any availability that only one thread doing assign/unassign work? I mean we only allow one thread to handle RIT and other threads only add region to RIT
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Chunhui
          It would be good to have such a thing where only one thread acts on RIT. But current code is not so. Hope i got your question? Correct me if wrong.

          Show
          ramkrishna.s.vasudevan added a comment - @Chunhui It would be good to have such a thing where only one thread acts on RIT. But current code is not so. Hope i got your question? Correct me if wrong.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Stack
          Can you take a look at this if you have some time?

          Show
          ramkrishna.s.vasudevan added a comment - @Stack Can you take a look at this if you have some time?
          Hide
          stack added a comment -

          Yes. Been trying to get to it all day but looks like it'll be tomorrow now. Sorry for delay Ram.

          Show
          stack added a comment - Yes. Been trying to get to it all day but looks like it'll be tomorrow now. Sorry for delay Ram.
          Hide
          ramkrishna.s.vasudevan added a comment -

          No problem Stack. I know its late there already.

          Show
          ramkrishna.s.vasudevan added a comment - No problem Stack. I know its late there already.
          Hide
          stack added a comment -

          I think I understand this patch now.

          In AM, we introduce a new Map, a Map of dead servers to their current set of region plans and regions-in-transition.

          This Map is populated in AM when SSH calls AM to say process a server that just died (processServerShutdown).

          If we try to assign a region in another thread at this time, a check by AM into this new Map will cause us to return a 'DON'T DO ANYTHING' flag because its being done elsewhere (in SSH eventually – and we return a flag rather than null because null triggers a different kind of special handling).

          If SSH is running, it is what is meant to prevail assigning regions rather than any balance or user assignment that may have been made concurrently.

          When it finishes, SSH clears the dead server from the AM Map.

          Does this sound right?

          I have some feedback on details of this patch but I would first like to get high-level comments out of the way.

          1. Is this patch even handling the root cause of this JIRA: i.e. dealing w/ OPENING znodes that were made against the dead server? In AM#processServerShutdown, we iterate the list of regions we get back from AM#this.servers and from this set we will remove RIT. But won't this original set of regions be missing regions that are OPENING or even OPENED but not yet handled because its only after the OPENED region has updated its znode and the znode has been removed by the OPEN handler do we add a region to AM#this.servers? Or is the OPENING region handled elsewhere?

          2. This patch distributes state and updating of the state across AM and SSH. It makes it tough to follow what is going on (least, it makes it tough for lesser mortals like myself to figure it). The new Map is populated as a side-effect of the call to AM#processServerShutdown. The dead server is cleared explicitly from the AM Map on the tail of SSH on its way out (Reading SSH only, you'd be clueless why this was being done). During the processing of SSH, we are updating a data structure that is held by this Map over in AM. Is there anything we can do to better encapsulate this new state management? What is we changed the method in AM from processServerShutdown to startProcessingServerShutdown and closed it off with a finishProcessingServerShutdown? Do we have to keep around a list of regionplans? Can't we just keep list of dead servers and any attempt at getting a region plan related to the dead server or server being processed as dead is put off because SSH will get to it?

          Here are some minor items while they are on my mind:

          1. On REGION_PLAN_ALREADY_INUSE, why not return null and if we get a null back from getRegionPlan in assign, check serverManager.createDestinationServersList(serverToExclude) is empty; if it is, update timeoutmonitor. This'd be cleaner than adding this override of RegionState data structure to do more than just carry a plan?
          2. On the tail of SSH, we do this.services.getAssignmentManager().removeDeadServerRegions(this.serverName); Just before that we clear regionsFromRegionPlansForServer and regionsInTransition. The clearings are unnecessary?

          Good stuff Ram and Rajeshbabu. I think this patch can work after a little cleanup. I'll have some more feedback this evening. Thought I'd send this in meantime so you have something to chew on when you get up this morning.

          Show
          stack added a comment - I think I understand this patch now. In AM, we introduce a new Map, a Map of dead servers to their current set of region plans and regions-in-transition. This Map is populated in AM when SSH calls AM to say process a server that just died (processServerShutdown). If we try to assign a region in another thread at this time, a check by AM into this new Map will cause us to return a 'DON'T DO ANYTHING' flag because its being done elsewhere (in SSH eventually – and we return a flag rather than null because null triggers a different kind of special handling). If SSH is running, it is what is meant to prevail assigning regions rather than any balance or user assignment that may have been made concurrently. When it finishes, SSH clears the dead server from the AM Map. Does this sound right? I have some feedback on details of this patch but I would first like to get high-level comments out of the way. 1. Is this patch even handling the root cause of this JIRA: i.e. dealing w/ OPENING znodes that were made against the dead server? In AM#processServerShutdown, we iterate the list of regions we get back from AM#this.servers and from this set we will remove RIT. But won't this original set of regions be missing regions that are OPENING or even OPENED but not yet handled because its only after the OPENED region has updated its znode and the znode has been removed by the OPEN handler do we add a region to AM#this.servers? Or is the OPENING region handled elsewhere? 2. This patch distributes state and updating of the state across AM and SSH. It makes it tough to follow what is going on (least, it makes it tough for lesser mortals like myself to figure it). The new Map is populated as a side-effect of the call to AM#processServerShutdown. The dead server is cleared explicitly from the AM Map on the tail of SSH on its way out (Reading SSH only, you'd be clueless why this was being done). During the processing of SSH, we are updating a data structure that is held by this Map over in AM. Is there anything we can do to better encapsulate this new state management? What is we changed the method in AM from processServerShutdown to startProcessingServerShutdown and closed it off with a finishProcessingServerShutdown? Do we have to keep around a list of regionplans? Can't we just keep list of dead servers and any attempt at getting a region plan related to the dead server or server being processed as dead is put off because SSH will get to it? Here are some minor items while they are on my mind: 1. On REGION_PLAN_ALREADY_INUSE, why not return null and if we get a null back from getRegionPlan in assign, check serverManager.createDestinationServersList(serverToExclude) is empty; if it is, update timeoutmonitor. This'd be cleaner than adding this override of RegionState data structure to do more than just carry a plan? 2. On the tail of SSH, we do this.services.getAssignmentManager().removeDeadServerRegions(this.serverName); Just before that we clear regionsFromRegionPlansForServer and regionsInTransition. The clearings are unnecessary? Good stuff Ram and Rajeshbabu. I think this patch can work after a little cleanup. I'll have some more feedback this evening. Thought I'd send this in meantime so you have something to chew on when you get up this morning.
          Hide
          stack added a comment -

          My other high level comment is:

          3. This patch will not fix a concurrent balance that goes to move a region and a concurrent user-orchestrated move?

          Show
          stack added a comment - My other high level comment is: 3. This patch will not fix a concurrent balance that goes to move a region and a concurrent user-orchestrated move?
          Hide
          Enis Soztutar added a comment -

          This issue, and the other related issues Ram has recently fixed makes me very nervous about all the state combinations distributed between zk / meta / rs-memory and master-memory. After this is done, do you think we can come up with a simpler design? I do not have any particular idea, so just spitballing here.

          Show
          Enis Soztutar added a comment - This issue, and the other related issues Ram has recently fixed makes me very nervous about all the state combinations distributed between zk / meta / rs-memory and master-memory. After this is done, do you think we can come up with a simpler design? I do not have any particular idea, so just spitballing here.
          Hide
          stack added a comment -

          Suggestion.

          Show
          stack added a comment - Suggestion.
          Hide
          ramkrishna.s.vasudevan added a comment -

          a Map of dead servers to their current set of region plans and regions-in-transition.

          We are only maintaining regions and the RITS. not the regionplans

          Does this sound right?

          Yes. Exactly. All this decision made on the current regionPlan.

          Is this patch even handling the root cause of this JIRA: i.e. dealing w/ OPENING znodes that were made against the dead server?

          Yes.

          In AM#processServerShutdown, we iterate the list of regions we get back from AM#this.servers and from this set we will remove RIT. But won't this original set of regions be missing regions that are OPENING or even OPENED but not yet handled because its only after the OPENED region has updated its znode and the znode has been removed by the OPEN handler do we add a region to AM#this.servers? Or is the OPENING region handled elsewhere?

          To answer this question, for the first part. If region R1 was initially in RS A and now it is getting moved to RS B. If the RS A is going down then the existing SSH flow for RS A will get the regions from AM#this.servers and remove the same from this.regions. Because already something is in RIT it will not assign it and so assignment happens smoothly to RS B.

          Now take a case where the RS B goes down. In the existing code no where we assigned those regions and left to the TM to assign. Now that is where this patch comes into place and tries to assign it based on the regionplan. When SSH sees that the dead server is in destination of the regionplan it will add to the new data structure and based on that it will start the assginment. Surely this region will not be in AM#this.servers as RS B is not the owner of this.

          To answer the 2nd question about OPENED regions. Again this should not be a problem. There are 2 cases in this
          1> OpenedRegionHandler is completed. -> Here this means that this.servers is already updated. And the region plan is also cleared. Now when SSH starts it will see the region in AM.this.servers and also there is nothing in RIT. So SSH will scan the meta find this region and go and assign it to a new RS. (Patch is not needed here)
          2> OpenedRegionHandler is not yet completed. -> Here there may be case OpenedRegionHandler has not yet done his work. So the regionPlan is still available but the this.servers and this.regions are not yet updated. Now as per patch we know that

            if (rit != null && !rit.isClosing() && !rit.isPendingClose() && !rit.isSplitting()
          +                && !regionsFromRegionPlansForServer.contains(rit.getRegion())) {
          

          The new datastructure contatins that region and hence we don not skip the assignment and go ahead with assignment. This region got added here because we have it in Regionplan with the destination as the dead server.

          we can do to better encapsulate this new state management?

          Yes we will try to do that.

          Can't we just keep list of dead servers and any attempt at getting a region plan related to the dead server or server being processed as dead is put off because SSH will get to it?

          Already we are doing this. But we cannot guarentee at what point the RS had gone down, is it after forming the regionPlan or before forming the regionplan. IF it is before forming the regionplan
          we are already excluding the dead servers.

          Pls correct me if am wrong and if you have any doubts

          Show
          ramkrishna.s.vasudevan added a comment - a Map of dead servers to their current set of region plans and regions-in-transition. We are only maintaining regions and the RITS. not the regionplans Does this sound right? Yes. Exactly. All this decision made on the current regionPlan. Is this patch even handling the root cause of this JIRA: i.e. dealing w/ OPENING znodes that were made against the dead server? Yes. In AM#processServerShutdown, we iterate the list of regions we get back from AM#this.servers and from this set we will remove RIT. But won't this original set of regions be missing regions that are OPENING or even OPENED but not yet handled because its only after the OPENED region has updated its znode and the znode has been removed by the OPEN handler do we add a region to AM#this.servers? Or is the OPENING region handled elsewhere? To answer this question, for the first part. If region R1 was initially in RS A and now it is getting moved to RS B. If the RS A is going down then the existing SSH flow for RS A will get the regions from AM#this.servers and remove the same from this.regions. Because already something is in RIT it will not assign it and so assignment happens smoothly to RS B. Now take a case where the RS B goes down. In the existing code no where we assigned those regions and left to the TM to assign. Now that is where this patch comes into place and tries to assign it based on the regionplan. When SSH sees that the dead server is in destination of the regionplan it will add to the new data structure and based on that it will start the assginment. Surely this region will not be in AM#this.servers as RS B is not the owner of this. To answer the 2nd question about OPENED regions. Again this should not be a problem. There are 2 cases in this 1> OpenedRegionHandler is completed. -> Here this means that this.servers is already updated. And the region plan is also cleared. Now when SSH starts it will see the region in AM.this.servers and also there is nothing in RIT. So SSH will scan the meta find this region and go and assign it to a new RS. (Patch is not needed here) 2> OpenedRegionHandler is not yet completed. -> Here there may be case OpenedRegionHandler has not yet done his work. So the regionPlan is still available but the this.servers and this.regions are not yet updated. Now as per patch we know that if (rit != null && !rit.isClosing() && !rit.isPendingClose() && !rit.isSplitting() + && !regionsFromRegionPlansForServer.contains(rit.getRegion())) { The new datastructure contatins that region and hence we don not skip the assignment and go ahead with assignment. This region got added here because we have it in Regionplan with the destination as the dead server. we can do to better encapsulate this new state management? Yes we will try to do that. Can't we just keep list of dead servers and any attempt at getting a region plan related to the dead server or server being processed as dead is put off because SSH will get to it? Already we are doing this. But we cannot guarentee at what point the RS had gone down, is it after forming the regionPlan or before forming the regionplan. IF it is before forming the regionplan we are already excluding the dead servers. Pls correct me if am wrong and if you have any doubts
          Hide
          ramkrishna.s.vasudevan added a comment -

          This patch will not fix a concurrent balance that goes to move a region and a concurrent user-orchestrated move?

          Currently does not handle this. But if we are sure that the region assignments goes smoothly after this patch we can surely have a mechanism to initmate the user that some plan is already in progress and wait for that to complete before we initiate the user triggered move or assign.
          @Enis

          very nervous about all the state combinations distributed between zk / meta / rs-memory and master-memory.

          Can you tell me any problems that you foresee. Happy to address that too here. Infact we have not introduced any new states. Just using the maps and zookeeper states available. The REgionplan state we introduced here is something internal to know the difference which we will address in our next patch as per Stack's suggestion.
          Your comments and feedback are always welcome.

          Show
          ramkrishna.s.vasudevan added a comment - This patch will not fix a concurrent balance that goes to move a region and a concurrent user-orchestrated move? Currently does not handle this. But if we are sure that the region assignments goes smoothly after this patch we can surely have a mechanism to initmate the user that some plan is already in progress and wait for that to complete before we initiate the user triggered move or assign. @Enis very nervous about all the state combinations distributed between zk / meta / rs-memory and master-memory. Can you tell me any problems that you foresee. Happy to address that too here. Infact we have not introduced any new states. Just using the maps and zookeeper states available. The REgionplan state we introduced here is something internal to know the difference which we will address in our next patch as per Stack's suggestion. Your comments and feedback are always welcome.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Stack
          Our comments crossed.. Once i refresehed saw you have attached a patch. Let me take a look at it. Thanks

          Show
          ramkrishna.s.vasudevan added a comment - @Stack Our comments crossed.. Once i refresehed saw you have attached a patch. Let me take a look at it. Thanks
          Hide
          stack added a comment -

          @Enis Yeah, its a big problem.

          Ram and Rajeshbabu, I just uploaded a spin on your patch. Its a bit more basic.

          It gets rid of REGION_PLAN_ALREADY_INUSE and returns null out of getRegionPlan instead.

          If a null region, we update the no servers flag if no regions else presume region assignment being done else where (in SSH or in timeoutmonitor).

          In AM, I changed the processServerShutdown to instead be processServerShutdownStart with a corresponding processServerShutdownFinish. When the first is called, we add the servername being processed to a list. When the latter is called, we remove it. Getting a regionplan, we now exclude servers being processed from the set of assignable servers. This simplifies getRegionPlan.

          processServerShutdownStart returns the Set of outstandingRegionPlans – plans that were related to the server that is being processed by shutdown handler – and the intersection of regions-in-transition with the set of regions the server was carrying. The patch then does as you were doing minus the removing of region from the Set that was kept back in AM.

          I've not verified this passes tests yet. It still needs more work. Just putting it up as more feedback on your posted patch.

          Regards Enis comment, I tried poking around some and its hard to follow state transitions across memory, zk, and then across retries and force repossessions. Any ideas for how we'd simplify all of this? I started to pull out a RegionsInTransition class. It would be a 'database' backed by zk. There would be no more in memory state, just whats out in zk. You'd go to this new RIT db to get current state and to ask it make transistions. Just a thought. I haven't dug in enough.

          Show
          stack added a comment - @Enis Yeah, its a big problem. Ram and Rajeshbabu, I just uploaded a spin on your patch. Its a bit more basic. It gets rid of REGION_PLAN_ALREADY_INUSE and returns null out of getRegionPlan instead. If a null region, we update the no servers flag if no regions else presume region assignment being done else where (in SSH or in timeoutmonitor). In AM, I changed the processServerShutdown to instead be processServerShutdownStart with a corresponding processServerShutdownFinish. When the first is called, we add the servername being processed to a list. When the latter is called, we remove it. Getting a regionplan, we now exclude servers being processed from the set of assignable servers. This simplifies getRegionPlan. processServerShutdownStart returns the Set of outstandingRegionPlans – plans that were related to the server that is being processed by shutdown handler – and the intersection of regions-in-transition with the set of regions the server was carrying. The patch then does as you were doing minus the removing of region from the Set that was kept back in AM. I've not verified this passes tests yet. It still needs more work. Just putting it up as more feedback on your posted patch. Regards Enis comment, I tried poking around some and its hard to follow state transitions across memory, zk, and then across retries and force repossessions. Any ideas for how we'd simplify all of this? I started to pull out a RegionsInTransition class. It would be a 'database' backed by zk. There would be no more in memory state, just whats out in zk. You'd go to this new RIT db to get current state and to ask it make transistions. Just a thought. I haven't dug in enough.
          Hide
          stack added a comment -

          We are only maintaining regions and the RITS. not the regionplans

          Yes. I should have said that.

          Thanks for the explanation of OPENING. I now understand. Do you think we could get your explanation into comments in the code in any way to help those that come after?

          ... But we cannot guarentee at what point the RS had gone down, is it after forming the regionPlan or before forming the regionplan. IF it is before forming the regionplan

          we are already excluding the dead servers.

          Hmm... ok. That means that in the patch I uploaded, the new Set of servers being processed by shutdown handler is redundant with the deadServers Set that the ServerManager is keeping up. I can simplify my patch further (if you think the way getRegionPlan works in my patch is ok).

          Show
          stack added a comment - We are only maintaining regions and the RITS. not the regionplans Yes. I should have said that. Thanks for the explanation of OPENING. I now understand. Do you think we could get your explanation into comments in the code in any way to help those that come after? ... But we cannot guarentee at what point the RS had gone down, is it after forming the regionPlan or before forming the regionplan. IF it is before forming the regionplan we are already excluding the dead servers. Hmm... ok. That means that in the patch I uploaded, the new Set of servers being processed by shutdown handler is redundant with the deadServers Set that the ServerManager is keeping up. I can simplify my patch further (if you think the way getRegionPlan works in my patch is ok).
          Hide
          stack added a comment -

          v2 is more basic still.

          It undoes the AM#processServerShutdownStart and AM#processServerShutdownFinish. This is not needed since servers that are in the process of being handled by ServerShutdownHandler will be excluded from the list of assignable servers returned out of the ServerManager.

          See what you think Ram. If you think this more basic patch viable, I'll go ahead and make it work w/ your unit tests.

          The remaining items are the isNullPlan and the return of a Pair out of AM#processServerShutdown. This stuff could be improved. I'd also wind in your comments from this issue since they help with what is going on.

          Show
          stack added a comment - v2 is more basic still. It undoes the AM#processServerShutdownStart and AM#processServerShutdownFinish. This is not needed since servers that are in the process of being handled by ServerShutdownHandler will be excluded from the list of assignable servers returned out of the ServerManager. See what you think Ram. If you think this more basic patch viable, I'll go ahead and make it work w/ your unit tests. The remaining items are the isNullPlan and the return of a Pair out of AM#processServerShutdown. This stuff could be improved. I'd also wind in your comments from this issue since they help with what is going on.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Stack
          Just to reiterate this JIRA is a backport of HBASE-5396. The current patch that you gave is similar to HBASE-5396 and it will solve the problem of assigning a region that has started opening in a RS but the RS went down.
          But if we don't have any common sharing between AM and SSH

          RegionPlan regionPlan = getRegionPlan(regionState, sn, true);
          

          In the first assign call we form a regionplan. It so happened that we got the plan and issued sendRegionOpen(). By the time the RS had gone down. Now as per your patch(HBASE-5396) the SSH will trigger and he will start a new assignment. At the same time

                  LOG.warn("Failed assignment of " +
                    state.getRegion().getRegionNameAsString() + " to " +
                    plan.getDestination() + ", trying to assign elsewhere instead; " +
                    "retry=" + i, t);
                  // Clean out plan we failed execute and one that doesn't look like it'll
                  // succeed anyways; we need a new plan!
                  // Transition back to OFFLINE
                  state.update(RegionState.State.OFFLINE);
                  // Force a new plan and reassign.  Will return null if no servers.
                          RegionPlan newPlan = getRegionPlan(state, plan.getDestination(), true);
                          if (isNullPlan(newPlan, state, plan.getDestination())) {
                            // Whether no servers to assign too or region assignment is being handled elsewhere,
                            // skip out of this retry loop.
                            break;
          

          Excluding this server we will call one more assign because of retry logic. And that is where there is a chance of getting the master abort problem. In the current patch that u gave getRegionPlan will create the above problem.

          This is where we need to have a common structure and also a new state of region plan to say whether to make another assign call or not thro AM.

              if (!hijack && !state.isClosed() && !state.isOffline()) {
                String msg = "Unexpected state : " + state + " .. Cannot transit it to OFFLINE.";
                this.master.abort(msg, new IllegalStateException(msg));
                return -1;
              }
          

          Now the problem of master abort was reported in HBASE-5816.(Because HBASE-5396 is committed to 0.90). Now what we are doing is to solve this as a whole so that we don't end up in any double assignment or any problem with master abort.

          One more problem in patch is 'outstandingRegionPlans' should be cleared here

          } else if (addressFromAM != null && !addressFromAM.equals(this.serverName)) {
          +              LOG.debug("Skip assigning region " + e.getKey().getRegionNameAsString()
          +                  + " because it has been opened in " + addressFromAM.getServerName());
          +            }
          

          STack i feel that sharing something common is necessary to avoid this problem.
          Just to add, HBASE-6147 i have given one patch, which when combined with the patch here will solve many other issues also.

          Do let me know if you still feel there are some more doubts on this?

          Show
          ramkrishna.s.vasudevan added a comment - @Stack Just to reiterate this JIRA is a backport of HBASE-5396 . The current patch that you gave is similar to HBASE-5396 and it will solve the problem of assigning a region that has started opening in a RS but the RS went down. But if we don't have any common sharing between AM and SSH RegionPlan regionPlan = getRegionPlan(regionState, sn, true ); In the first assign call we form a regionplan. It so happened that we got the plan and issued sendRegionOpen(). By the time the RS had gone down. Now as per your patch( HBASE-5396 ) the SSH will trigger and he will start a new assignment. At the same time LOG.warn( "Failed assignment of " + state.getRegion().getRegionNameAsString() + " to " + plan.getDestination() + ", trying to assign elsewhere instead; " + "retry=" + i, t); // Clean out plan we failed execute and one that doesn't look like it'll // succeed anyways; we need a new plan! // Transition back to OFFLINE state.update(RegionState.State.OFFLINE); // Force a new plan and reassign. Will return null if no servers. RegionPlan newPlan = getRegionPlan(state, plan.getDestination(), true ); if (isNullPlan(newPlan, state, plan.getDestination())) { // Whether no servers to assign too or region assignment is being handled elsewhere, // skip out of this retry loop. break ; Excluding this server we will call one more assign because of retry logic. And that is where there is a chance of getting the master abort problem. In the current patch that u gave getRegionPlan will create the above problem. This is where we need to have a common structure and also a new state of region plan to say whether to make another assign call or not thro AM. if (!hijack && !state.isClosed() && !state.isOffline()) { String msg = "Unexpected state : " + state + " .. Cannot transit it to OFFLINE." ; this .master.abort(msg, new IllegalStateException(msg)); return -1; } Now the problem of master abort was reported in HBASE-5816 .(Because HBASE-5396 is committed to 0.90). Now what we are doing is to solve this as a whole so that we don't end up in any double assignment or any problem with master abort. One more problem in patch is 'outstandingRegionPlans' should be cleared here } else if (addressFromAM != null && !addressFromAM.equals( this .serverName)) { + LOG.debug( "Skip assigning region " + e.getKey().getRegionNameAsString() + + " because it has been opened in " + addressFromAM.getServerName()); + } STack i feel that sharing something common is necessary to avoid this problem. Just to add, HBASE-6147 i have given one patch, which when combined with the patch here will solve many other issues also. Do let me know if you still feel there are some more doubts on this?
          Hide
          stack added a comment -

          Excluding this server we will call one more assign because of retry logic. And that is where there is a chance of getting the master abort problem. In the current patch that u gave getRegionPlan will create the above problem.

          Because we will create a new plan and though the new plan will not land on the server that is being processed by shutdown handler, the assign trying to update the znode will bump into the illegalstateexception because SSH may be at the same time trying to assign? If so, that is true. That is a problem.

          But Ram, the problem w/ your patch is that getRegionPlan is being called with forceNewPlan=true and you are NOT forcing a new plan, right?, which is going to confuse no end? We should break up getRegionPlan into region plan getting and then another method on whether to proceed w/ assign?

          One more problem in patch is 'outstandingRegionPlans' should be cleared here...

          Yes. That is to prevent it being reassigned by SSH. Good one.

          I appreciate your bringing me along. Will look more when get to work.... will look at hbase-6147 too.

          Show
          stack added a comment - Excluding this server we will call one more assign because of retry logic. And that is where there is a chance of getting the master abort problem. In the current patch that u gave getRegionPlan will create the above problem. Because we will create a new plan and though the new plan will not land on the server that is being processed by shutdown handler, the assign trying to update the znode will bump into the illegalstateexception because SSH may be at the same time trying to assign? If so, that is true. That is a problem. But Ram, the problem w/ your patch is that getRegionPlan is being called with forceNewPlan=true and you are NOT forcing a new plan, right?, which is going to confuse no end? We should break up getRegionPlan into region plan getting and then another method on whether to proceed w/ assign? One more problem in patch is 'outstandingRegionPlans' should be cleared here... Yes. That is to prevent it being reassigned by SSH. Good one. I appreciate your bringing me along. Will look more when get to work.... will look at hbase-6147 too.
          Hide
          stack added a comment -

          Just some notes after looking more at this stuff.

          We have two different assigns; one that assigns a single region and then another that assigns multiple regions in the one go. The first takes out a lock on the RegionState before it runs and is careful about using existing assignment plans if they exist. It'll prevent concurrent assigns but it does not prevent hbase-5816 where to assigns of same region around the same time will cause master exit because the second assign will find zk in "illegal" state.

          The second assign does not respect existing plans nor does it synchronize on RegionState. The second assign type is used bulk assigning. SSH now does bulk assign, since hbase-5914, a recent change, once its figured regions that need assigning. The patch in here is careful about making sure the bulk assign is favored. It tries to undo any ongoing assigns that are happening via the first method.

          Show
          stack added a comment - Just some notes after looking more at this stuff. We have two different assigns; one that assigns a single region and then another that assigns multiple regions in the one go. The first takes out a lock on the RegionState before it runs and is careful about using existing assignment plans if they exist. It'll prevent concurrent assigns but it does not prevent hbase-5816 where to assigns of same region around the same time will cause master exit because the second assign will find zk in "illegal" state. The second assign does not respect existing plans nor does it synchronize on RegionState. The second assign type is used bulk assigning. SSH now does bulk assign, since hbase-5914, a recent change, once its figured regions that need assigning. The patch in here is careful about making sure the bulk assign is favored. It tries to undo any ongoing assigns that are happening via the first method.
          Hide
          stack added a comment -

          What about this Ram? Its about four lines. I haven't tried it yet. Would need clean up. Just seeing what you think. It just treats OPENING regions that are on the dead server as it does CLOSING, CLOSED, etc., so it gets assigned as part of the SSH bulk assign.

          Should not cause a HBASE-5816 because that was our including in the bulk assign regions that could have been in OFFLINE/PENDING_OPEN RegionState; i.e. regions being handled by the 'normal', single assign.

          Show
          stack added a comment - What about this Ram? Its about four lines. I haven't tried it yet. Would need clean up. Just seeing what you think. It just treats OPENING regions that are on the dead server as it does CLOSING, CLOSED, etc., so it gets assigned as part of the SSH bulk assign. Should not cause a HBASE-5816 because that was our including in the bulk assign regions that could have been in OFFLINE/PENDING_OPEN RegionState; i.e. regions being handled by the 'normal', single assign.
          Hide
          rajeshbabu added a comment -

          @Stack

          It just treats OPENING regions that are on the dead server as it does CLOSING, CLOSED, etc., so it gets assigned as part of the SSH bulk assign.

          We need to consider regions in OPEN state also(along with OPENING) for assignment.

          Should not cause a HBASE-5816 because that was our including in the bulk assign regions that could have been in OFFLINE/PENDING_OPEN RegionState; i.e. regions being handled by the 'normal', single assign.

          you mean to say skip assignment of the regions in OFFLINE or PENDING_OPEN while processing server shutdown, because they will be assigned as part of retry?
          If this is the case there is one problem
          -> Lets suppose Region server went down after spawning OpenRegionHandler and before transitioning to OPENING then its SSH responsibility to assign regions in OFFLINE/PENDING_OPEN.

          Alternative 2 itself may not solve HBASE-6060 because

                    hris = MetaReader.getServerUserRegions(this.server.getCatalogTracker(),
                      this.serverName);
          

          during opening region if destination server details not updated in META then the region wont present in hris. In that case we are not calling assign for the regions which are opening( region info not present in hris) on the region server(waiting for timeout monitor to trigger assignment).

          Pls correct me if i am wrong.

          Show
          rajeshbabu added a comment - @Stack It just treats OPENING regions that are on the dead server as it does CLOSING, CLOSED, etc., so it gets assigned as part of the SSH bulk assign. We need to consider regions in OPEN state also(along with OPENING) for assignment. Should not cause a HBASE-5816 because that was our including in the bulk assign regions that could have been in OFFLINE/PENDING_OPEN RegionState; i.e. regions being handled by the 'normal', single assign. you mean to say skip assignment of the regions in OFFLINE or PENDING_OPEN while processing server shutdown, because they will be assigned as part of retry? If this is the case there is one problem -> Lets suppose Region server went down after spawning OpenRegionHandler and before transitioning to OPENING then its SSH responsibility to assign regions in OFFLINE/PENDING_OPEN. Alternative 2 itself may not solve HBASE-6060 because hris = MetaReader.getServerUserRegions( this .server.getCatalogTracker(), this .serverName); during opening region if destination server details not updated in META then the region wont present in hris. In that case we are not calling assign for the regions which are opening( region info not present in hris) on the region server(waiting for timeout monitor to trigger assignment). Pls correct me if i am wrong.
          Hide
          stack added a comment -

          We need to consider regions in OPEN state also(along with OPENING) for assignment.

          Yes. Makes sense.

          you mean to say skip assignment of the regions in OFFLINE or PENDING_OPEN while processing server shutdown, because they will be assigned as part of retry?

          Yes.

          Lets suppose Region server went down after spawning OpenRegionHandler and before transitioning to OPENING then its SSH responsibility to assign regions in OFFLINE/PENDING_OPEN.

          This is a good one. I'd argue this is a hole in our transition states. Region is OFFLINE but the regionserver is supposed to have control. I'd suggest that a regionserver not return from open region until its moved the znode from PENDING_OPEN to OPENING (need to look more into what would be involved making this happen – and what to do on bulk assign though it seems we drop a bunch of the handling already when bulk assigning). Shouldn't region belong to master or regionserver without a gray area in-between while PENDING_OPEN is going on?

          during opening region if destination server details not updated in META then the region wont present in hris...

          This is also a good one. I found it earlier this afternoon making your tests work w/ the alternate patch (There is another 'hole' that I found in the case where there are no regions on the server that died; in this case we'll skip out early and not find associated OPENING or OPEN regions – my patch is much larger than four lines now (smile)).

          What do you think of this alternate solution Rajeshbabu? Assuming I close the above holes, do you think it easier to reason about? We don't have to keep up state distributed across AM and SSH (I think it more clear whats going on), we don't have to use a special RegionPlan to signal the single region assign method to give up on its assign (I'd think we'd want regions assigned by the single-assign call rather than the bulk assign since the single assign does more checking), and we are not built atop the current state of regionplans, which seems a little bit of an indirect foundation given we are trying to keep up RegionState in regionsInTransition.

          Thanks for the thorough review.

          Show
          stack added a comment - We need to consider regions in OPEN state also(along with OPENING) for assignment. Yes. Makes sense. you mean to say skip assignment of the regions in OFFLINE or PENDING_OPEN while processing server shutdown, because they will be assigned as part of retry? Yes. Lets suppose Region server went down after spawning OpenRegionHandler and before transitioning to OPENING then its SSH responsibility to assign regions in OFFLINE/PENDING_OPEN. This is a good one. I'd argue this is a hole in our transition states. Region is OFFLINE but the regionserver is supposed to have control. I'd suggest that a regionserver not return from open region until its moved the znode from PENDING_OPEN to OPENING (need to look more into what would be involved making this happen – and what to do on bulk assign though it seems we drop a bunch of the handling already when bulk assigning). Shouldn't region belong to master or regionserver without a gray area in-between while PENDING_OPEN is going on? during opening region if destination server details not updated in META then the region wont present in hris... This is also a good one. I found it earlier this afternoon making your tests work w/ the alternate patch (There is another 'hole' that I found in the case where there are no regions on the server that died; in this case we'll skip out early and not find associated OPENING or OPEN regions – my patch is much larger than four lines now (smile)). What do you think of this alternate solution Rajeshbabu? Assuming I close the above holes, do you think it easier to reason about? We don't have to keep up state distributed across AM and SSH (I think it more clear whats going on), we don't have to use a special RegionPlan to signal the single region assign method to give up on its assign (I'd think we'd want regions assigned by the single-assign call rather than the bulk assign since the single assign does more checking), and we are not built atop the current state of regionplans, which seems a little bit of an indirect foundation given we are trying to keep up RegionState in regionsInTransition. Thanks for the thorough review.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Rajesh
          Thanks for taking care of the reviews.
          @Stack

          I'd think we'd want regions assigned by the single-assign call rather than the bulk assign since the single assign does more checking)

          Yes, I agree. But the gain with bulk assignment is its faster which we wanted in HBASE-5914.

          Shouldn't region belong to master or regionserver without a gray area in-between while PENDING_OPEN is going on?

          Stack, in this case first the region belongs to master. Only after the RS changes to OPENING the znode of the regions belongs to the RS. This still happens as per the current code. Infact this behaviour of using the owner of the node is the logic behind how we do processRIT on a master restart.

          else if (data.getOrigin() != null &&
                      !serverManager.isServerOnline(data.getOrigin())) {
                    // to handle cases where offline node is created but sendRegionOpen
                    // RPC is not yet sent
                    addToRITandCallClose(regionInfo, RegionState.State.OFFLINE, data);
          

          But what if the RS goes down before sending the RPC call to RS and that is where the retry logic comes, which i feel is really needed.

          Show
          ramkrishna.s.vasudevan added a comment - @Rajesh Thanks for taking care of the reviews. @Stack I'd think we'd want regions assigned by the single-assign call rather than the bulk assign since the single assign does more checking) Yes, I agree. But the gain with bulk assignment is its faster which we wanted in HBASE-5914 . Shouldn't region belong to master or regionserver without a gray area in-between while PENDING_OPEN is going on? Stack, in this case first the region belongs to master. Only after the RS changes to OPENING the znode of the regions belongs to the RS. This still happens as per the current code. Infact this behaviour of using the owner of the node is the logic behind how we do processRIT on a master restart. else if (data.getOrigin() != null && !serverManager.isServerOnline(data.getOrigin())) { // to handle cases where offline node is created but sendRegionOpen // RPC is not yet sent addToRITandCallClose(regionInfo, RegionState.State.OFFLINE, data); But what if the RS goes down before sending the RPC call to RS and that is where the retry logic comes, which i feel is really needed.
          Hide
          rajeshbabu added a comment -

          @Ram
          Attached patch. Just a suggestion to skip the regions in OFFLINE/PENDING_OPEN and RS went down before request sent.

          But need to handle this case:

          Lets suppose Region server went down after spawning OpenRegionHandler and before transitioning to OPENING then its SSH responsibility to assign regions in OFFLINE/PENDING_OPEN.

          In this patch skipping regions to be assigned in case rs went down before request

                if ((region.getState() == RegionState.State.OFFLINE)
                    && (region.getState() == RegionState.State.PENDING_OPEN)) {
                  regionPlans.remove(region.getRegion());
                }
          
          Show
          rajeshbabu added a comment - @Ram Attached patch. Just a suggestion to skip the regions in OFFLINE/PENDING_OPEN and RS went down before request sent. But need to handle this case: Lets suppose Region server went down after spawning OpenRegionHandler and before transitioning to OPENING then its SSH responsibility to assign regions in OFFLINE/PENDING_OPEN. In this patch skipping regions to be assigned in case rs went down before request if ((region.getState() == RegionState.State.OFFLINE) && (region.getState() == RegionState.State.PENDING_OPEN)) { regionPlans.remove(region.getRegion()); }
          Hide
          stack added a comment -

          But the gain with bulk assignment is its faster which we wanted in HBASE-5914.

          Yes. We are keeping bulk assigns, its just that the suggested patch favors the single-assign letting it do retries whereas the original patch wants to do all via bulk assign canceling any ongoing single-assigns.

          Only after the RS changes to OPENING the znode of the regions belongs to the RS.

          Yes, this is the case for znode states. The problem is the PENDING_OPEN state, an in-memory state kept in RegionState. It exists across the sending of the rpc open to the regionserver until the OPENING callback comes into the master saying the regionserver has moved the znode from OFFLINE to OPENING. This is the the messy state.

          Infact this behaviour of using the owner of the node is the logic behind how we do processRIT on a master restart.

          The context is different I think; there is no ongoing retry of single-assign here. Its a cleaner case.

          But what if the RS goes down before sending the RPC call to RS and that is where the retry logic comes, which i feel is really needed.

          Do you mean the master sending the RPC?

          Show
          stack added a comment - But the gain with bulk assignment is its faster which we wanted in HBASE-5914 . Yes. We are keeping bulk assigns, its just that the suggested patch favors the single-assign letting it do retries whereas the original patch wants to do all via bulk assign canceling any ongoing single-assigns. Only after the RS changes to OPENING the znode of the regions belongs to the RS. Yes, this is the case for znode states. The problem is the PENDING_OPEN state, an in-memory state kept in RegionState. It exists across the sending of the rpc open to the regionserver until the OPENING callback comes into the master saying the regionserver has moved the znode from OFFLINE to OPENING. This is the the messy state. Infact this behaviour of using the owner of the node is the logic behind how we do processRIT on a master restart. The context is different I think; there is no ongoing retry of single-assign here. Its a cleaner case. But what if the RS goes down before sending the RPC call to RS and that is where the retry logic comes, which i feel is really needed. Do you mean the master sending the RPC?
          Hide
          stack added a comment -

          @rajeshbabu So, you are allowing the single-assign to continue its retries? Regions that are OFFINE or PENDING_OPEN are not bulk assigned? (Yes, there is still the case of RS going down before it moves region to OPENING)

          Show
          stack added a comment - @rajeshbabu So, you are allowing the single-assign to continue its retries? Regions that are OFFINE or PENDING_OPEN are not bulk assigned? (Yes, there is still the case of RS going down before it moves region to OPENING)
          Hide
          rajeshbabu added a comment -

          @Stack

          you are allowing the single-assign to continue its retries?

          Yes.

          Show
          rajeshbabu added a comment - @Stack you are allowing the single-assign to continue its retries? Yes.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12531590/6060_suggestion_toassign_rs_wentdown_beforerequest.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 5 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2134//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2134//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12531590/6060_suggestion_toassign_rs_wentdown_beforerequest.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 5 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2134//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2134//console This message is automatically generated.
          Hide
          Ted Yu added a comment -

          For 6060_suggestion_toassign_rs_wentdown_beforerequest.patch:

          Can you give the following variable better name ?

          +    Set<HRegionInfo> regionPlans = new ConcurrentSkipListSet<HRegionInfo>();
          

          The set doesn't hold region plans. The following javadoc needs to be adjusted accordingly.

          +   * @return Pair that has all regionplans that pertain to this dead server and a list that has
          
          +      if ((region.getState() == RegionState.State.OFFLINE)
          +          && (region.getState() == RegionState.State.PENDING_OPEN)) {
          

          A region cannot be in both states at the same time. '||' should be used instead of '&&'

          +        deadRegions = new TreeSet<HRegionInfo>(assignedRegions);
          

          Since the fulfillment of deadRegions above is in a different code block from the following:

                 if (deadRegions.remove(region.getRegion())) {
          

          Running testSSHWhenSourceRSandDestRSInRegionPlanGoneDown (from v3) would lead to NPE w.r.t. deadRegions

          After fixing the above, testSSHWhenSourceRSandDestRSInRegionPlanGoneDown still fails.

          Show
          Ted Yu added a comment - For 6060_suggestion_toassign_rs_wentdown_beforerequest.patch: Can you give the following variable better name ? + Set<HRegionInfo> regionPlans = new ConcurrentSkipListSet<HRegionInfo>(); The set doesn't hold region plans. The following javadoc needs to be adjusted accordingly. + * @ return Pair that has all regionplans that pertain to this dead server and a list that has + if ((region.getState() == RegionState.State.OFFLINE) + && (region.getState() == RegionState.State.PENDING_OPEN)) { A region cannot be in both states at the same time. '||' should be used instead of '&&' + deadRegions = new TreeSet<HRegionInfo>(assignedRegions); Since the fulfillment of deadRegions above is in a different code block from the following: if (deadRegions.remove(region.getRegion())) { Running testSSHWhenSourceRSandDestRSInRegionPlanGoneDown (from v3) would lead to NPE w.r.t. deadRegions After fixing the above, testSSHWhenSourceRSandDestRSInRegionPlanGoneDown still fails.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Ted
          Myself or Rajesh, either of us will investigate the test failures tomorrw once we reach office.

          Show
          ramkrishna.s.vasudevan added a comment - @Ted Myself or Rajesh, either of us will investigate the test failures tomorrw once we reach office.
          Hide
          stack added a comment -

          Rajesh says:

          Lets suppose Region server went down after spawning OpenRegionHandler and before transitioning to OPENING then its SSH responsibility to assign regions in OFFLINE/PENDING_OPEN.

          I say:
          ba. Shouldn't region belong to master or regionserver without a gray area in-between while PENDING_OPEN is going on?

          Ram says:

          Stack, in this case first the region belongs to master. Only after the RS changes to OPENING the znode of the regions belongs to the RS.

          So, Rajesh identifies a hole, I claim the hole is murky, inspecific, and Ram claims there is no hole really.

          Below I argue that there is a hole and a small change cleans up RegionState states making SSH processing more clean.

          Ram, what you say is true, if you are looking at znode states only. If you are looking at RegionState, the in-memory reflection of what a regions' state is according to the master, then what PENDING_OPEN covers, a state that does not have a corresponding znode state, is unclear.

          I want to rely on whats in RegionState figuring what SSH should process (Rajesh's latest patch seems to want to walk this path too).

          PENDING_OPEN spans the master sending the rpc open currently. It is set before we do the rpc invocation so if the regionserver goes down, if a region's state is PENDING_OPEN, should it be handled by SSH or will it get retried by the single-assign method? I can't tell for sure. If the regionserver went down while the rpc was outstanding, the single-assign will retry. It will actually set the RegionState back to OFFLINE temporarily – which makes it even harder figuring whats going on if looking from another thread. PENDING_OPEN as is is worse than useless.

          How about this Ram and Rajesh

          +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          @@ -1715,14 +1715,18 @@ public class AssignmentManager extends ZooKeeperListener {
                 try {
                   LOG.debug("Assigning region " + state.getRegion().getRegionNameAsString() +
                     " to " + plan.getDestination().toString());
          -        // Transition RegionState to PENDING_OPEN
          -        state.update(RegionState.State.PENDING_OPEN, System.currentTimeMillis(),
          -            plan.getDestination());
                   // Send OPEN RPC. This can fail if the server on other end is is not up.
                   // Pass the version that was obtained while setting the node to OFFLINE.
                   RegionOpeningState regionOpenState = serverManager.sendRegionOpen(plan
                       .getDestination(), state.getRegion(), versionOfOfflineNode);
          -        if (regionOpenState == RegionOpeningState.ALREADY_OPENED) {
          +        if (regionOpenState.equals(RegionOpeningState.OPENED)) {
          +          // Transition RegionState to PENDING_OPEN. It covers the period between the send of the
          +          // rpc and our getting the callback setting the region state to OPENING.  This is
          +          // in-memory only change.   Out in zk the znode is OFFLINE and we are waiting on
          +          // regionserver to assume ownership by moving it to OPENING.
          +          state.update(RegionState.State.PENDING_OPEN, System.currentTimeMillis(),
          +            plan.getDestination());
          +        } else if (regionOpenState == RegionOpeningState.ALREADY_OPENED) {
                     // Remove region from in-memory transition and unassigned node from ZK
                     // While trying to enable the table the regions of the table were
                     // already enabled.
          

          Here we set region to be PENDING_OPEN AFTER we send the open rpc. Now we know that a region that is PENDING_OPEN will not be retried by the single-assign and the state is clear; its that period post open rpc but before we get the znode callback which sets the RegionState to OPENING.

          Over in SSH, I can safely add PENDING_OPEN regions to the set of those to bulk assign if they are against the dead server currently being processed.

          What do you fellas think?

          I need to look at OFFLINE states too to see if they will always get retried by single-assign. If so, we can leave these out of the SSH recover.

          Show
          stack added a comment - Rajesh says: Lets suppose Region server went down after spawning OpenRegionHandler and before transitioning to OPENING then its SSH responsibility to assign regions in OFFLINE/PENDING_OPEN. I say: ba. Shouldn't region belong to master or regionserver without a gray area in-between while PENDING_OPEN is going on? Ram says: Stack, in this case first the region belongs to master. Only after the RS changes to OPENING the znode of the regions belongs to the RS. So, Rajesh identifies a hole, I claim the hole is murky, inspecific, and Ram claims there is no hole really. Below I argue that there is a hole and a small change cleans up RegionState states making SSH processing more clean. Ram, what you say is true, if you are looking at znode states only. If you are looking at RegionState, the in-memory reflection of what a regions' state is according to the master, then what PENDING_OPEN covers, a state that does not have a corresponding znode state, is unclear. I want to rely on whats in RegionState figuring what SSH should process (Rajesh's latest patch seems to want to walk this path too). PENDING_OPEN spans the master sending the rpc open currently. It is set before we do the rpc invocation so if the regionserver goes down, if a region's state is PENDING_OPEN, should it be handled by SSH or will it get retried by the single-assign method? I can't tell for sure. If the regionserver went down while the rpc was outstanding, the single-assign will retry. It will actually set the RegionState back to OFFLINE temporarily – which makes it even harder figuring whats going on if looking from another thread. PENDING_OPEN as is is worse than useless. How about this Ram and Rajesh +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java @@ -1715,14 +1715,18 @@ public class AssignmentManager extends ZooKeeperListener { try { LOG.debug( "Assigning region " + state.getRegion().getRegionNameAsString() + " to " + plan.getDestination().toString()); - // Transition RegionState to PENDING_OPEN - state.update(RegionState.State.PENDING_OPEN, System .currentTimeMillis(), - plan.getDestination()); // Send OPEN RPC. This can fail if the server on other end is is not up. // Pass the version that was obtained while setting the node to OFFLINE. RegionOpeningState regionOpenState = serverManager.sendRegionOpen(plan .getDestination(), state.getRegion(), versionOfOfflineNode); - if (regionOpenState == RegionOpeningState.ALREADY_OPENED) { + if (regionOpenState.equals(RegionOpeningState.OPENED)) { + // Transition RegionState to PENDING_OPEN. It covers the period between the send of the + // rpc and our getting the callback setting the region state to OPENING. This is + // in-memory only change. Out in zk the znode is OFFLINE and we are waiting on + // regionserver to assume ownership by moving it to OPENING. + state.update(RegionState.State.PENDING_OPEN, System .currentTimeMillis(), + plan.getDestination()); + } else if (regionOpenState == RegionOpeningState.ALREADY_OPENED) { // Remove region from in-memory transition and unassigned node from ZK // While trying to enable the table the regions of the table were // already enabled. Here we set region to be PENDING_OPEN AFTER we send the open rpc. Now we know that a region that is PENDING_OPEN will not be retried by the single-assign and the state is clear; its that period post open rpc but before we get the znode callback which sets the RegionState to OPENING. Over in SSH, I can safely add PENDING_OPEN regions to the set of those to bulk assign if they are against the dead server currently being processed. What do you fellas think? I need to look at OFFLINE states too to see if they will always get retried by single-assign. If so, we can leave these out of the SSH recover.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Stack
          Few comments,
          First of all we need to do some more changes if we want to make the above change in handleRegion

                    // Should see OPENING after we have asked it to OPEN or additional
                    // times after already being in state of OPENING
                    if (regionState == null ||
                        (!regionState.isPendingOpen() && !regionState.isOpening())) {
                      LOG.warn("Received OPENING for region " +
                          prettyPrintedRegionName +
                          " from server " + data.getOrigin() + " but region was in " +
                          " the state " + regionState + " and not " +
                          "in expected PENDING_OPEN or OPENING states");
          

          Second is, you mean like the above change of making pending_open after the RPC call along with the changes in the earlier patches like having the new items lik 'ritintersection' and 'outstandingRegionPlans'?
          Here there is one concern is, if before we update the state to PENDING_OPEN in master side if the OpenRegionHandler has really transitioned the node to OPENING then the inmemory state will also be changing to OPENING after we correct the code above i have mentioned. Now in that case the OPENING state will be rewritten to PENDING_OPEN? So we may need to add a check to see if already a change has happened in the REgionState.

          The problem with Rajesh's patch is (6060_suggestion_toassign_rs_wentdown_beforerequest.patch

          +      if ((region.getState() == RegionState.State.OFFLINE)
          +          && (region.getState() == RegionState.State.PENDING_OPEN)) {
          +        regionPlans.remove(region.getRegion());
          +      }
          

          What if the RS went down befor sending RPC. The SSH collected the regionPlan but before he could remove the collected regionplan in the above code, if the master completed the retry assignment and the OpenRegionHandler has changed to OPENING and the state is now Opening, then we will again try to assign thro SSH.

          The above problem can happen in the change that you mentioned also like moving PENDING_OPEN after RPC.

          So can we take a copy of RIT befor forming the RegionPlan and work based on that? Will update the change that we are suggesting in some time.

          Show
          ramkrishna.s.vasudevan added a comment - @Stack Few comments, First of all we need to do some more changes if we want to make the above change in handleRegion // Should see OPENING after we have asked it to OPEN or additional // times after already being in state of OPENING if (regionState == null || (!regionState.isPendingOpen() && !regionState.isOpening())) { LOG.warn( "Received OPENING for region " + prettyPrintedRegionName + " from server " + data.getOrigin() + " but region was in " + " the state " + regionState + " and not " + "in expected PENDING_OPEN or OPENING states" ); Second is, you mean like the above change of making pending_open after the RPC call along with the changes in the earlier patches like having the new items lik 'ritintersection' and 'outstandingRegionPlans'? Here there is one concern is, if before we update the state to PENDING_OPEN in master side if the OpenRegionHandler has really transitioned the node to OPENING then the inmemory state will also be changing to OPENING after we correct the code above i have mentioned. Now in that case the OPENING state will be rewritten to PENDING_OPEN? So we may need to add a check to see if already a change has happened in the REgionState. The problem with Rajesh's patch is (6060_suggestion_toassign_rs_wentdown_beforerequest.patch + if ((region.getState() == RegionState.State.OFFLINE) + && (region.getState() == RegionState.State.PENDING_OPEN)) { + regionPlans.remove(region.getRegion()); + } What if the RS went down befor sending RPC. The SSH collected the regionPlan but before he could remove the collected regionplan in the above code, if the master completed the retry assignment and the OpenRegionHandler has changed to OPENING and the state is now Opening, then we will again try to assign thro SSH. The above problem can happen in the change that you mentioned also like moving PENDING_OPEN after RPC. So can we take a copy of RIT befor forming the RegionPlan and work based on that? Will update the change that we are suggesting in some time.
          Hide
          stack added a comment -

          What do I have to change in that section of code? It seems fine as is. Or are you thinking we need to allow it being in OFFLINE state?

          Yes, agree that before setting PENDING_OPEN, need to ensure it has not moved to OPENING. That should be easy enough (in fact, its not possible now I think about it since there a lock on RegionState that surrounds the single-assign – need to make sure handleRegion respects it).

          What do you think of this approach Ram? The one where we rely on RegionState rather than on RegionPlan?

          I won't have the issue you see in Rajesh's patch, right? I don't intend to collect regions from outstanding regionplans. If the RS goes down before sending the RPC, then the single-assign will fail and retry but it won't be included in the SSH set?

          Show
          stack added a comment - What do I have to change in that section of code? It seems fine as is. Or are you thinking we need to allow it being in OFFLINE state? Yes, agree that before setting PENDING_OPEN, need to ensure it has not moved to OPENING. That should be easy enough (in fact, its not possible now I think about it since there a lock on RegionState that surrounds the single-assign – need to make sure handleRegion respects it). What do you think of this approach Ram? The one where we rely on RegionState rather than on RegionPlan? I won't have the issue you see in Rajesh's patch, right? I don't intend to collect regions from outstanding regionplans. If the RS goes down before sending the RPC, then the single-assign will fail and retry but it won't be included in the SSH set?
          Hide
          ramkrishna.s.vasudevan added a comment -

          I don't intend to collect regions from outstanding regionplans

          I think we need this.(like the one in earlier patches). Or else if the region has been moved to OPENING and the destination RS goes down how do we try to process them. It will not be in META also right as it is still not opened in the destination RS. So a logic similar to Rajesh's patch will be needed but that along with your suggestion needs some minor tweaks.
          @Rajesh
          Correct me if am wrong. This is what we discussed today?
          @Stack
          We had a patch but could not upload it from office due to some other tasks. May be we could upload one by tomo morn once we reach office.
          Pls correct me if am wrong. (We discussed lot of scenarios today also, may be i can miss somethings here).

          Show
          ramkrishna.s.vasudevan added a comment - I don't intend to collect regions from outstanding regionplans I think we need this.(like the one in earlier patches). Or else if the region has been moved to OPENING and the destination RS goes down how do we try to process them. It will not be in META also right as it is still not opened in the destination RS. So a logic similar to Rajesh's patch will be needed but that along with your suggestion needs some minor tweaks. @Rajesh Correct me if am wrong. This is what we discussed today? @Stack We had a patch but could not upload it from office due to some other tasks. May be we could upload one by tomo morn once we reach office. Pls correct me if am wrong. (We discussed lot of scenarios today also, may be i can miss somethings here).
          Hide
          stack added a comment -

          Or else if the region has been moved to OPENING and the destination RS goes down how do we try to process them.

          The RegionState will have the server that reported the region in OPENING. We don't need to get it from RegionPlan (in AM#processServerShutdown, as part of the run through all RITs, we need to look for RegionStates that are against the server that has just died).

          Looking forward to your patch tomorrow. I'll upload one too (working on tests today).

          Show
          stack added a comment - Or else if the region has been moved to OPENING and the destination RS goes down how do we try to process them. The RegionState will have the server that reported the region in OPENING. We don't need to get it from RegionPlan (in AM#processServerShutdown, as part of the run through all RITs, we need to look for RegionStates that are against the server that has just died). Looking forward to your patch tomorrow. I'll upload one too (working on tests today).
          Hide
          ramkrishna.s.vasudevan added a comment -

          In the lastest patch there is one more problem. Uploaded to just sync up if this what you are trying Stack.

          Now after calling openregion if the OpenRegionHandler is spawned and just after that the RS goes down.And still the PENDING_OPEN is not yet updated and by the time SSH starts processing. Then as per the updated latest patch we will not be able to call assign. No testcase runs with this patch needs modifications.

          Show
          ramkrishna.s.vasudevan added a comment - In the lastest patch there is one more problem. Uploaded to just sync up if this what you are trying Stack. Now after calling openregion if the OpenRegionHandler is spawned and just after that the RS goes down.And still the PENDING_OPEN is not yet updated and by the time SSH starts processing. Then as per the updated latest patch we will not be able to call assign. No testcase runs with this patch needs modifications.
          Hide
          stack added a comment -

          Now after calling openregion if the OpenRegionHandler is spawned and just after that the RS goes down.And still the PENDING_OPEN is not yet updated and by the time SSH starts processing. Then as per the updated latest patch we will not be able to call assign.

          Can you explain more Ram? The rpc open has been sent (successfully), the RS goes down, but we have not set it to PENDING_OPEN – and at this time the SSH runs? Is that what you are suggesting? So, its the window between open rpc return, and the setting of PENDING_OPEN – in this window, if the SSH runs to completion we'll have an unassigned region, one that will have to be picked up by the timeout monitor? Is that what you are suggesting?

          Thanks Ram (Sorry that this stuff is taking so long...)

          Show
          stack added a comment - Now after calling openregion if the OpenRegionHandler is spawned and just after that the RS goes down.And still the PENDING_OPEN is not yet updated and by the time SSH starts processing. Then as per the updated latest patch we will not be able to call assign. Can you explain more Ram? The rpc open has been sent (successfully), the RS goes down, but we have not set it to PENDING_OPEN – and at this time the SSH runs? Is that what you are suggesting? So, its the window between open rpc return, and the setting of PENDING_OPEN – in this window, if the SSH runs to completion we'll have an unassigned region, one that will have to be picked up by the timeout monitor? Is that what you are suggesting? Thanks Ram (Sorry that this stuff is taking so long...)
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Stack
          Absolutely correct. May be you have someother idea for that?

          Show
          ramkrishna.s.vasudevan added a comment - @Stack Absolutely correct. May be you have someother idea for that?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12531775/HBASE-6060-trunk_4.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 5 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
          org.apache.hadoop.hbase.util.TestFSUtils
          org.apache.hadoop.hbase.catalog.TestMetaReaderEditor

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2145//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2145//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2145//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2145//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12531775/HBASE-6060-trunk_4.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 5 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster org.apache.hadoop.hbase.util.TestFSUtils org.apache.hadoop.hbase.catalog.TestMetaReaderEditor Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2145//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2145//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2145//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2145//console This message is automatically generated.
          Hide
          stack added a comment -

          @Ram So, the rpc returns successfully. The very next step is to set RegionState to PENDING_OPEN. Somehow, in between these two points, a timeout of the RS znode will expire, a verification of ROOT and META will take place, a full scan of META (a new RPC) will happen, and all before we get to setting the PENDING_OPEN state? Seems unlikely. If it ever happens, I suppose this would be one for the timeout monitor to clean up? Or do you have a scenario where this case is more likely?

          Show
          stack added a comment - @Ram So, the rpc returns successfully. The very next step is to set RegionState to PENDING_OPEN. Somehow, in between these two points, a timeout of the RS znode will expire, a verification of ROOT and META will take place, a full scan of META (a new RPC) will happen, and all before we get to setting the PENDING_OPEN state? Seems unlikely. If it ever happens, I suppose this would be one for the timeout monitor to clean up? Or do you have a scenario where this case is more likely?
          Hide
          ramkrishna.s.vasudevan added a comment -

          This is the open point that we see in moving the PENDING_OPEN after rpc. As far as we brainstormed we did not find any other gaps in this. But our initial patch that shares state across AM and SSH solves all problems. .
          You have any new patch Stack? May be we need to work on testcases if the latest patch is fine with you..Thanks a lot for spending your time on this defect with your feedbacks and nice new ideas.

          Show
          ramkrishna.s.vasudevan added a comment - This is the open point that we see in moving the PENDING_OPEN after rpc. As far as we brainstormed we did not find any other gaps in this. But our initial patch that shares state across AM and SSH solves all problems. . You have any new patch Stack? May be we need to work on testcases if the latest patch is fine with you..Thanks a lot for spending your time on this defect with your feedbacks and nice new ideas.
          Hide
          rajeshbabu added a comment -

          Patch with some test cases and corrections. TestAssignmentManager passes.

          Show
          rajeshbabu added a comment - Patch with some test cases and corrections. TestAssignmentManager passes.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12531867/HBASE-6060_trunk_5.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 5 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.TestAtomicOperation
          org.apache.hadoop.hbase.master.TestDistributedLogSplitting
          org.apache.hadoop.hbase.master.TestSplitLogManager

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2154//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2154//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2154//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2154//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12531867/HBASE-6060_trunk_5.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 5 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestAtomicOperation org.apache.hadoop.hbase.master.TestDistributedLogSplitting org.apache.hadoop.hbase.master.TestSplitLogManager Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2154//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2154//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2154//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2154//console This message is automatically generated.
          Hide
          stack added a comment -

          @Ram Thinking on it, the way to plug the hole you and Rajesh have identfied is by having the RS update the znode to OPENING form OFFINE before returning from the open rpc call.

          The way I see it, we've found problems in our RegionState. As it is now, we cannot ask RegionState to reliably find the state of a region. Its so bad, you fellas went and made a solution figuring where a region is at using another system altogether, the RegionPlan state which strikes me as a little odd; because the system we should be relying on is broke – RegionState – you fellas look at the ghost traces of region moves, the impression left in RegionPlan. Is this a fair characterization?

          Don't you think we should fix RegionState while we're in here?

          I've been working on a patch but was going to do it piecemeal, fix hbase-6199 first?

          Show
          stack added a comment - @Ram Thinking on it, the way to plug the hole you and Rajesh have identfied is by having the RS update the znode to OPENING form OFFINE before returning from the open rpc call. The way I see it, we've found problems in our RegionState. As it is now, we cannot ask RegionState to reliably find the state of a region. Its so bad, you fellas went and made a solution figuring where a region is at using another system altogether, the RegionPlan state which strikes me as a little odd; because the system we should be relying on is broke – RegionState – you fellas look at the ghost traces of region moves, the impression left in RegionPlan. Is this a fair characterization? Don't you think we should fix RegionState while we're in here? I've been working on a patch but was going to do it piecemeal, fix hbase-6199 first?
          Hide
          ramkrishna.s.vasudevan added a comment -

          Is this a fair characterization?

          RegionState is something which the master uses to see the current progress of the assignment. But regionplan is one thing which will tels me where the assignment is actually going on.
          If master wants to start a new assignment its again the regionplan that gets changed. So we thought of using that as an indicator. But i agree two systems AM and SSH take decision based on that..

          One more thing on RegionState in RIT is that it is reactive. Based on some changes in RS it gets updated. But RegionPlan is the one decided by the master.
          Every time Assignment starts RegionState in RIT goes thro a set of steps and may be that is why we are not sure as in what step the RIT is and who made that change.

          We also tried thought of one approach, like can we remove the retry logic itself in assign? Always use SSH to trigger assignment as in TRUNK the expiry of server is very much spontaneous now. May be some gaps are there which we have not yet found.

          I should accept one thing is I did not try to change the RegionState in RIT and was seeing to use the RegionPlan only.

          RS update the znode to OPENING form OFFINE before returning from the open rpc call.

          Then this step has to be moved from OpenRegionHandler?

          Show
          ramkrishna.s.vasudevan added a comment - Is this a fair characterization? RegionState is something which the master uses to see the current progress of the assignment. But regionplan is one thing which will tels me where the assignment is actually going on. If master wants to start a new assignment its again the regionplan that gets changed. So we thought of using that as an indicator. But i agree two systems AM and SSH take decision based on that.. One more thing on RegionState in RIT is that it is reactive. Based on some changes in RS it gets updated. But RegionPlan is the one decided by the master. Every time Assignment starts RegionState in RIT goes thro a set of steps and may be that is why we are not sure as in what step the RIT is and who made that change. We also tried thought of one approach, like can we remove the retry logic itself in assign? Always use SSH to trigger assignment as in TRUNK the expiry of server is very much spontaneous now. May be some gaps are there which we have not yet found. I should accept one thing is I did not try to change the RegionState in RIT and was seeing to use the RegionPlan only. RS update the znode to OPENING form OFFINE before returning from the open rpc call. Then this step has to be moved from OpenRegionHandler?
          Hide
          stack added a comment -

          But i agree two systems AM and SSH take decision based on that..

          Sorry, I don't follow? Are you saying we should rely on RegionState? Or on RegionPlan? (Plan is what we 'want' to happen; RegionState should be where regions actually 'are'. I'd think that we'd want to deal w/ current state making a decision rather than planned state?)

          One more thing on RegionState in RIT is that it is reactive....Every time Assignment starts RegionState in RIT goes thro a set of steps and may be that is why we are not sure as in what step the RIT is and who made that change.

          In SSH, we need the answer to one basic question only; i.e. who owns the region, master or regionserver. Unless you have a better idea, I think setting of znode to OPENING before returning from the open rpc necessary to plug the gray area you fellas identified.

          We also tried thought of one approach, like can we remove the retry logic itself in assign?

          It seems like the retry has been there a long time (Originally we just recursed calling assign on exception but in HBASE-3263 we changed it to a bounded loop). It seems like the retry is ok since we'll try a different server if we fail on first plan? We also like this single-assign method because it more rigorous about state management than bulk-assign? Its only a problem if a concurrent assign when we are unsure who is responsible for the region.

          Show
          stack added a comment - But i agree two systems AM and SSH take decision based on that.. Sorry, I don't follow? Are you saying we should rely on RegionState? Or on RegionPlan? (Plan is what we 'want' to happen; RegionState should be where regions actually 'are'. I'd think that we'd want to deal w/ current state making a decision rather than planned state?) One more thing on RegionState in RIT is that it is reactive....Every time Assignment starts RegionState in RIT goes thro a set of steps and may be that is why we are not sure as in what step the RIT is and who made that change. In SSH, we need the answer to one basic question only; i.e. who owns the region, master or regionserver. Unless you have a better idea, I think setting of znode to OPENING before returning from the open rpc necessary to plug the gray area you fellas identified. We also tried thought of one approach, like can we remove the retry logic itself in assign? It seems like the retry has been there a long time (Originally we just recursed calling assign on exception but in HBASE-3263 we changed it to a bounded loop). It seems like the retry is ok since we'll try a different server if we fail on first plan? We also like this single-assign method because it more rigorous about state management than bulk-assign? Its only a problem if a concurrent assign when we are unsure who is responsible for the region.
          Hide
          chunhui shen added a comment -

          The lastest patch -> HBASE-6060_trunk_5.patch is easy to understand, Nice!
          But I have one doubt,

           // We have sent the open rpc to the region server.  We hold this state until regionserver
          +      // assumes control of the region by setting znode to OPENING from OFFLINE (This is an
          +      // in-memory state only with no corresponding znode state
          +      PENDING_OPEN,
          +
          +      // Set by zk callback when regionserver moves znode from OFFLINE to OPENING.  Regionserver now
          +      // has control of the region.
          +      OPENING,
          

          Since we change the sense of PENDING_OPEN and OPENING,
          should we do corresponding change?

          e.g.
          doing state.update(RegionState.State.PENDING_OPEN, System.currentTimeMillis(),
          plan.getDestination());
          after
          doing serverManager.sendRegionOpen

          Pardon me if wrong, thanks.

          Show
          chunhui shen added a comment - The lastest patch -> HBASE-6060 _trunk_5.patch is easy to understand, Nice! But I have one doubt, // We have sent the open rpc to the region server. We hold this state until regionserver + // assumes control of the region by setting znode to OPENING from OFFLINE (This is an + // in-memory state only with no corresponding znode state + PENDING_OPEN, + + // Set by zk callback when regionserver moves znode from OFFLINE to OPENING. Regionserver now + // has control of the region. + OPENING, Since we change the sense of PENDING_OPEN and OPENING, should we do corresponding change? e.g. doing state.update(RegionState.State.PENDING_OPEN, System.currentTimeMillis(), plan.getDestination()); after doing serverManager.sendRegionOpen Pardon me if wrong, thanks.
          Hide
          rajeshbabu added a comment -

          @chunhui
          Sorry I forgot to add them in HBASE-6060_trunk_5.patch but available in HBASE-6060_trunk_4.patch.
          Still there is one open problem with this solution. You can find here.
          https://issues.apache.org/jira/browse/HBASE-6060?focusedCommentId=13293382&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13293382

          Thanks.

          Show
          rajeshbabu added a comment - @chunhui Sorry I forgot to add them in HBASE-6060 _trunk_5.patch but available in HBASE-6060 _trunk_4.patch. Still there is one open problem with this solution. You can find here. https://issues.apache.org/jira/browse/HBASE-6060?focusedCommentId=13293382&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13293382 Thanks.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Stack
          So for the latest version of patch, can we move the step where the node is changed from OFFLINE to OPENING and let the remaining part be in the OpenRegionHandler?

          Show
          ramkrishna.s.vasudevan added a comment - @Stack So for the latest version of patch, can we move the step where the node is changed from OFFLINE to OPENING and let the remaining part be in the OpenRegionHandler?
          Hide
          stack added a comment -

          @Ram Yes.

          I put something for discussion up in RB: https://reviews.apache.org/r/5781/ If you and/or Rajesh get a chance, see what you think.

          There are a million holes still as I see it in particular around bulk assign, etc. (See the TODO list).

          Show
          stack added a comment - @Ram Yes. I put something for discussion up in RB: https://reviews.apache.org/r/5781/ If you and/or Rajesh get a chance, see what you think. There are a million holes still as I see it in particular around bulk assign, etc. (See the TODO list).
          Hide
          Ted Yu added a comment -

          I only see Review button on the above URL - maybe it is my review board account that malfunctions.

          Show
          Ted Yu added a comment - I only see Review button on the above URL - maybe it is my review board account that malfunctions.
          Hide
          Lars Hofhansl added a comment -

          It seems we are a bit away from a fix. Can this go into 0.94.2?

          Show
          Lars Hofhansl added a comment - It seems we are a bit away from a fix. Can this go into 0.94.2?
          Hide
          stack added a comment -

          @Lars Yes.

          @Ted My fault. See https://reviews.apache.org/r/5796/ instead.

          Show
          stack added a comment - @Lars Yes. @Ted My fault. See https://reviews.apache.org/r/5796/ instead.
          Hide
          rajeshbabu added a comment -

          @Stack
          In HRegionServer.openRegion() if zk transition from OFFLINE to OPENING fails we need to remove region from regionsInTransitionInRS of that RS as in OpenRegionHandler. Otherwise we end up with RegionAlreadyInTransitionException if we call assign to the same region(In bulk assign we will call assign in case of FAILED_OPENING)

          Show
          rajeshbabu added a comment - @Stack In HRegionServer.openRegion() if zk transition from OFFLINE to OPENING fails we need to remove region from regionsInTransitionInRS of that RS as in OpenRegionHandler. Otherwise we end up with RegionAlreadyInTransitionException if we call assign to the same region(In bulk assign we will call assign in case of FAILED_OPENING)
          Hide
          stack added a comment -

          @Ram Good one. Will refresh patch up in RB. Otherwise, do you think this approach will work. Whats your thoughts on bulk assign problem? Or on the variety of ways in which we can force OFFLINE a region?

          Show
          stack added a comment - @Ram Good one. Will refresh patch up in RB. Otherwise, do you think this approach will work. Whats your thoughts on bulk assign problem? Or on the variety of ways in which we can force OFFLINE a region?
          Hide
          stack added a comment -

          I added update up on RB to address Ted and Ram comments.

          Show
          stack added a comment - I added update up on RB to address Ted and Ram comments.
          Hide
          Ted Yu added a comment -

          There seems to be compilation error for Stack's patch v2:

          [ERROR] /Users/zhihyu/trunk-hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/master/MockRegionServer.java:[101,0] org.apache.hadoop.hbase.master.MockRegionServer is not abstract and does not override abstract method containsKeyInRegionsInTransition(org.apache.hadoop.hbase.HRegionInfo) in org.apache.hadoop.hbase.regionserver.RegionServerServices
          [ERROR] 
          [ERROR] /Users/zhihyu/trunk-hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/master/MockRegionServer.java:[326,2] method does not override or implement a method from a supertype
          [ERROR] 
          [ERROR] /Users/zhihyu/trunk-hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java:[45,7] org.apache.hadoop.hbase.util.MockRegionServerServices is not abstract and does not override abstract method containsKeyInRegionsInTransition(org.apache.hadoop.hbase.HRegionInfo) in org.apache.hadoop.hbase.regionserver.RegionServerServices
          [ERROR] 
          [ERROR] /Users/zhihyu/trunk-hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java:[92,2] method does not override or implement a method from a supertype
          
          Show
          Ted Yu added a comment - There seems to be compilation error for Stack's patch v2: [ERROR] /Users/zhihyu/trunk-hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/master/MockRegionServer.java:[101,0] org.apache.hadoop.hbase.master.MockRegionServer is not abstract and does not override abstract method containsKeyInRegionsInTransition(org.apache.hadoop.hbase.HRegionInfo) in org.apache.hadoop.hbase.regionserver.RegionServerServices [ERROR] [ERROR] /Users/zhihyu/trunk-hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/master/MockRegionServer.java:[326,2] method does not override or implement a method from a supertype [ERROR] [ERROR] /Users/zhihyu/trunk-hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java:[45,7] org.apache.hadoop.hbase.util.MockRegionServerServices is not abstract and does not override abstract method containsKeyInRegionsInTransition(org.apache.hadoop.hbase.HRegionInfo) in org.apache.hadoop.hbase.regionserver.RegionServerServices [ERROR] [ERROR] /Users/zhihyu/trunk-hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java:[92,2] method does not override or implement a method from a supertype
          Hide
          stack added a comment -

          I can fix that.

          But I just thought of something; the way I'm doing the test of a region in transition over on the RS is wrong, the stuff I added to address the Ram feedback. I can actually bump aside an ongoing RIT. Will fix in my morning. Please do not review v2 yet Ram.

          Show
          stack added a comment - I can fix that. But I just thought of something; the way I'm doing the test of a region in transition over on the RS is wrong, the stuff I added to address the Ram feedback. I can actually bump aside an ongoing RIT. Will fix in my morning. Please do not review v2 yet Ram.
          Hide
          stack added a comment -

          v3 should be good. I posted it up on RB. Addresses Ted and Ram's issues.

          Show
          stack added a comment - v3 should be good. I posted it up on RB. Addresses Ted and Ram's issues.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Stack
          I am still reviewing this. Tomorrow i will give all my comments once i reach office. Just need to see few things carefully. Thanks Stack.

          Show
          ramkrishna.s.vasudevan added a comment - @Stack I am still reviewing this. Tomorrow i will give all my comments once i reach office. Just need to see few things carefully. Thanks Stack.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Stack
          Some comments on RB.

          Show
          ramkrishna.s.vasudevan added a comment - @Stack Some comments on RB.
          Hide
          stack added a comment -

          Responded Ram.

          Show
          stack added a comment - Responded Ram.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Stack
          Ok with approach stack. Few cases may still be hiding out there
          Nice of you Stack.

          Show
          ramkrishna.s.vasudevan added a comment - @Stack Ok with approach stack. Few cases may still be hiding out there Nice of you Stack.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Do we need to test the perf because now the transition from OFFLINE to OPENING is done in the main thread?

          Show
          ramkrishna.s.vasudevan added a comment - Do we need to test the perf because now the transition from OFFLINE to OPENING is done in the main thread?
          Hide
          stack added a comment -

          Do we need to test the perf because now the transition from OFFLINE to OPENING is done in the main thread?

          Yeah, its going to take longer for the rpc to complete though we are doing the same amount of 'work' opening a region. I'm not worried about the single region open case. I'd think we will add a few ticks. But bulk opens will be interesting. I've not made the change there, yet. These try to be async so maybe it'll be ok there... will see.

          What you think?

          Show
          stack added a comment - Do we need to test the perf because now the transition from OFFLINE to OPENING is done in the main thread? Yeah, its going to take longer for the rpc to complete though we are doing the same amount of 'work' opening a region. I'm not worried about the single region open case. I'd think we will add a few ticks. But bulk opens will be interesting. I've not made the change there, yet. These try to be async so maybe it'll be ok there... will see. What you think?
          Hide
          stack added a comment -

          This won't be done for 0.94.1 is my guess. Moving it out.

          Show
          stack added a comment - This won't be done for 0.94.1 is my guess. Moving it out.
          Hide
          Jimmy Xiang added a comment -

          @Rajesh, could you update your patch since HBASE-6272 is in, and put the diff on RB?

          Show
          Jimmy Xiang added a comment - @Rajesh, could you update your patch since HBASE-6272 is in, and put the diff on RB?
          Hide
          stack added a comment -

          @Jimmy I did the last patch. I think you should review it for 'merit'. If you think it worth it, I'll update to fit hbase-6272.

          Show
          stack added a comment - @Jimmy I did the last patch. I think you should review it for 'merit'. If you think it worth it, I'll update to fit hbase-6272.
          Hide
          Jimmy Xiang added a comment -

          @Stack, no problem, doesn't have to.

          Show
          Jimmy Xiang added a comment - @Stack, no problem, doesn't have to.
          Hide
          Jimmy Xiang added a comment -

          Patch 5 is good to me. In addition to those assigned regions on the dead server, it also re-asssigns those regions in transition with destination server = the dead server.

          Show
          Jimmy Xiang added a comment - Patch 5 is good to me. In addition to those assigned regions on the dead server, it also re-asssigns those regions in transition with destination server = the dead server.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Jimmy/@Stack
          If you can give me a couple of days i can give a patch.. Actually after discussing with Rajesh, the patch 5 has some things to be taken from Stack's patch up in RB. And Stack's patch is not the final one as per the review comments over in RB.

          Show
          ramkrishna.s.vasudevan added a comment - @Jimmy/@Stack If you can give me a couple of days i can give a patch.. Actually after discussing with Rajesh, the patch 5 has some things to be taken from Stack's patch up in RB. And Stack's patch is not the final one as per the review comments over in RB.
          Hide
          ramkrishna.s.vasudevan added a comment -

          I am working on an updated patch.. Should be able to submit it by today or tomorrow.

          Show
          ramkrishna.s.vasudevan added a comment - I am working on an updated patch.. Should be able to submit it by today or tomorrow.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Latest patch. This is a combined one that is available in RB and other patches submitted.

          Show
          ramkrishna.s.vasudevan added a comment - Latest patch. This is a combined one that is available in RB and other patches submitted.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12540918/HBASE-6060_latest.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 15 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings).

          -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.handler.TestOpenRegionHandler
          org.apache.hadoop.hbase.regionserver.handler.TestCloseRegionHandler
          org.apache.hadoop.hbase.master.TestDistributedLogSplitting
          org.apache.hadoop.hbase.regionserver.TestSplitLogWorker

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2568//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2568//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2568//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2568//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2568//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2568//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2568//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12540918/HBASE-6060_latest.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.handler.TestOpenRegionHandler org.apache.hadoop.hbase.regionserver.handler.TestCloseRegionHandler org.apache.hadoop.hbase.master.TestDistributedLogSplitting org.apache.hadoop.hbase.regionserver.TestSplitLogWorker Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2568//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2568//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2568//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2568//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2568//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2568//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2568//console This message is automatically generated.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Some testcases are failing. Let me check and update the patch.

          Show
          ramkrishna.s.vasudevan added a comment - Some testcases are failing. Let me check and update the patch.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Updated patch.. The failed testcases passes

          Show
          ramkrishna.s.vasudevan added a comment - Updated patch.. The failed testcases passes
          Hide
          ramkrishna.s.vasudevan added a comment -

          New review board request created. https://reviews.apache.org/r/6636/

          Show
          ramkrishna.s.vasudevan added a comment - New review board request created. https://reviews.apache.org/r/6636/
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12541080/HBASE-6060_latest.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 21 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings).

          -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.master.TestRollingRestart
          org.apache.hadoop.hbase.master.TestDistributedLogSplitting
          org.apache.hadoop.hbase.regionserver.handler.TestOpenRegionHandler

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12541080/HBASE-6060_latest.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 21 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.master.TestRollingRestart org.apache.hadoop.hbase.master.TestDistributedLogSplitting org.apache.hadoop.hbase.regionserver.handler.TestOpenRegionHandler Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//console This message is automatically generated.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Sorry for not updating the testcase TestOpenRegionhandler in two testcases. I had done it only one testcase. Sorry about that.

          Show
          ramkrishna.s.vasudevan added a comment - Sorry for not updating the testcase TestOpenRegionhandler in two testcases. I had done it only one testcase. Sorry about that.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12541239/HBASE-6060_latest.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 21 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings).

          -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.master.TestDistributedLogSplitting
          org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2598//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2598//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2598//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2598//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2598//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2598//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2598//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12541239/HBASE-6060_latest.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 21 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.master.TestDistributedLogSplitting org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2598//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2598//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2598//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2598//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2598//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2598//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2598//console This message is automatically generated.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Pls review the latest patch in the RB and provide your comments.

          Show
          ramkrishna.s.vasudevan added a comment - Pls review the latest patch in the RB and provide your comments.
          Hide
          Lars Hofhansl added a comment -

          The last patch on RB is 7 weeks old. Is that still the one to review?

          Show
          Lars Hofhansl added a comment - The last patch on RB is 7 weeks old. Is that still the one to review?
          Hide
          Lars Hofhansl added a comment -

          Ah. NM, there is a new review request now.

          Show
          Lars Hofhansl added a comment - Ah. NM, there is a new review request now.
          Hide
          Lars Hofhansl added a comment -

          Still too much discussion than I am comfortable with for the upcoming 0.94.2.

          Show
          Lars Hofhansl added a comment - Still too much discussion than I am comfortable with for the upcoming 0.94.2.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @LArs
          The latest one on the RB is https://reviews.apache.org/r/6636/. It was created 2 weeks back and already 2 diffs have been added to it.
          Anyway this is for trunk. If Stack and Jimmy are ok with this I can surely give you a patch for 0.94.

          Show
          ramkrishna.s.vasudevan added a comment - @LArs The latest one on the RB is https://reviews.apache.org/r/6636/ . It was created 2 weeks back and already 2 diffs have been added to it. Anyway this is for trunk. If Stack and Jimmy are ok with this I can surely give you a patch for 0.94.
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Stack
          Recently Jon's mail thread on rewriting AM altogether implies that the scenario mentioned here also will be covered ?

          Show
          ramkrishna.s.vasudevan added a comment - @Stack Recently Jon's mail thread on rewriting AM altogether implies that the scenario mentioned here also will be covered ?
          Hide
          stack added a comment -

          @Ram He's not at 'total rewrite' yet (smile). What you think of that thread Ram?

          Show
          stack added a comment - @Ram He's not at 'total rewrite' yet (smile). What you think of that thread Ram?
          Hide
          ramkrishna.s.vasudevan added a comment -

          @Stack, Jon's comments makes sense. Lot of thought process is needed though.
          A better FSM may be needed. Surely can try it out.

          Show
          ramkrishna.s.vasudevan added a comment - @Stack, Jon's comments makes sense. Lot of thought process is needed though. A better FSM may be needed. Surely can try it out.
          Hide
          Lars Hofhansl added a comment -

          Looking at the patches and trying to read through the comments here, I have to admit... I have NO idea what is going on.
          I think in 0.96 Jimmy is ripping apart AM, which would hopefully cover this.

          Should we attempt this at all on 0.94, or just let it go? Please let me know.

          Show
          Lars Hofhansl added a comment - Looking at the patches and trying to read through the comments here, I have to admit... I have NO idea what is going on. I think in 0.96 Jimmy is ripping apart AM, which would hopefully cover this. Should we attempt this at all on 0.94, or just let it go? Please let me know.
          Hide
          Jimmy Xiang added a comment -

          I think we can let it go for 0.94 since timeout monitor can handle it and there is no better way to fix it, because the region state in 0.94 is not so reliable.

          For 0.96, this one is not covered yet. It still relies on timeout monitor. Let me cook up a patch for 0.96 now.

          Show
          Jimmy Xiang added a comment - I think we can let it go for 0.94 since timeout monitor can handle it and there is no better way to fix it, because the region state in 0.94 is not so reliable. For 0.96, this one is not covered yet. It still relies on timeout monitor. Let me cook up a patch for 0.96 now.
          Hide
          Jimmy Xiang added a comment -

          I uploaded a simple patch for 0.96: trunk-6060.patch. Could you please review?

          Show
          Jimmy Xiang added a comment - I uploaded a simple patch for 0.96: trunk-6060.patch. Could you please review?
          Hide
          stack added a comment -

          How does this patch address this issue Jimmy?

          Way back I was trying to harden who owns the region in an earlier incarnation of this patch. We had gray areas where region could be PENDING_OPEN and it was unclear if region was owned by the master or owned by the regionserver on failure, though repair was different depending on who had region control. Fixing region states is for another JIRA?

          Show
          stack added a comment - How does this patch address this issue Jimmy? Way back I was trying to harden who owns the region in an earlier incarnation of this patch. We had gray areas where region could be PENDING_OPEN and it was unclear if region was owned by the master or owned by the regionserver on failure, though repair was different depending on who had region control. Fixing region states is for another JIRA?
          Hide
          Ted Yu added a comment -
          -    this.stamp.set(System.currentTimeMillis());
          +    setTimestamp(System.currentTimeMillis());
          

          EnvironmentEdgeManager should be used.

          It would be nice to include test(s) from Rajesh's patch(s).

          Show
          Ted Yu added a comment - - this .stamp.set( System .currentTimeMillis()); + setTimestamp( System .currentTimeMillis()); EnvironmentEdgeManager should be used. It would be nice to include test(s) from Rajesh's patch(s).
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12551016/trunk-6060.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          -1 javadoc. The javadoc tool appears to have generated 85 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3158//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3158//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3158//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3158//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3158//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3158//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3158//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12551016/trunk-6060.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 85 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 findbugs . The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3158//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3158//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3158//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3158//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3158//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3158//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3158//console This message is automatically generated.
          Hide
          Jimmy Xiang added a comment -

          @Stack, here is my understanding on the problem. Master calls a rs to open a region. Now, in master memory, the region is in pending_open state with this rs' server name. Now the rs dies. When SSH starts, it goes to meta to find all the regions on this rs, minus those regions already in transition, then assign the remaining regions. If the pending_open region (it could be opening too depending on timing) was on this region server before, SSH will take care of it. Otherwise, if it was on a different region server, SSH will not pick it up. In this patch, I just times out the region transition so that tm can change the state and re-assign it, instead of waiting for a long time (now, 20 minutes by default).

          I'd like to make sure the region states in master memory is reliable. Otherwise, it is of not much use. So I think master always has region control.
          In 0.96, I think region states is very reliable now. Of course, there could be bugs I am not aware of yet.

          @Ted, good point. I will include the test. For EnvironmentEdgeManager, I will leave it to another jira.

          Show
          Jimmy Xiang added a comment - @Stack, here is my understanding on the problem. Master calls a rs to open a region. Now, in master memory, the region is in pending_open state with this rs' server name. Now the rs dies. When SSH starts, it goes to meta to find all the regions on this rs, minus those regions already in transition, then assign the remaining regions. If the pending_open region (it could be opening too depending on timing) was on this region server before, SSH will take care of it. Otherwise, if it was on a different region server, SSH will not pick it up. In this patch, I just times out the region transition so that tm can change the state and re-assign it, instead of waiting for a long time (now, 20 minutes by default). I'd like to make sure the region states in master memory is reliable. Otherwise, it is of not much use. So I think master always has region control. In 0.96, I think region states is very reliable now. Of course, there could be bugs I am not aware of yet. @Ted, good point. I will include the test. For EnvironmentEdgeManager, I will leave it to another jira.
          Hide
          Jimmy Xiang added a comment -

          Added a test, and uploaded v2: trunk-6060_v2.patch.

          Show
          Jimmy Xiang added a comment - Added a test, and uploaded v2: trunk-6060_v2.patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12551096/trunk-6060_v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          -1 javadoc. The javadoc tool appears to have generated 85 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.client.TestFromClientSideWithCoprocessor
          org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3165//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3165//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12551096/trunk-6060_v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified tests. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 85 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 findbugs . The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.client.TestFromClientSideWithCoprocessor org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3165//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3165//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3165//console This message is automatically generated.
          Hide
          rajeshbabu added a comment -

          @Jimmy,
          Patch looks good and it properly reassigns the regions transitioning(which are in PENDING_OPEN,OPENING state) to dead server by TM and also avoids double assignments.
          But the regions which are in OPEN state may not be assigned.

                    RegionState rit = services.getAssignmentManager()
                      .getRegionStates().getRegionTransitionState(e.getKey());
                    if (processDeadRegion(e.getKey(), e.getValue(),
                        this.services.getAssignmentManager(),
                        this.server.getCatalogTracker())) {
                      ServerName addressFromAM = this.services.getAssignmentManager()
                        .getRegionStates().getRegionServerOfRegion(e.getKey());
                      if (rit != null && !rit.isClosing() && !rit.isPendingClose() && !rit.isSplitting()) {
                        // Skip regions that were in transition unless CLOSING or
                        // PENDING_CLOSE
                        LOG.info("Skip assigning region " + rit.toString());
          

          If the region state is OPEN and server in RIT same as the dead server we can add it to toAssignRegions to assign.

          -> Before forcing region state to offline we are checking current state and if its other than CLOSED or OFFLINE(forceNewPlan is true in case of assign call from TM) we calling unassing on the region.
          if the server in the RIT is dead we are deleting znode if its in M_ZK_REGION_CLOSING or RS_ZK_REGION_CLOSED but in case of PENDING_OPEN,OPENING,OPEN case znode state wont be both. May be we can avoid two network calls to zookeeper here by having region state check.

              if (!serverManager.isServerOnline(server)) {
                // delete the node. if no node exists need not bother.
                deleteClosingOrClosedNode(region);
                regionOffline(region);
                return;
              }
          

          Please correct me if wrong. Great work Jimmy.

          Show
          rajeshbabu added a comment - @Jimmy, Patch looks good and it properly reassigns the regions transitioning(which are in PENDING_OPEN,OPENING state) to dead server by TM and also avoids double assignments. But the regions which are in OPEN state may not be assigned. RegionState rit = services.getAssignmentManager() .getRegionStates().getRegionTransitionState(e.getKey()); if (processDeadRegion(e.getKey(), e.getValue(), this .services.getAssignmentManager(), this .server.getCatalogTracker())) { ServerName addressFromAM = this .services.getAssignmentManager() .getRegionStates().getRegionServerOfRegion(e.getKey()); if (rit != null && !rit.isClosing() && !rit.isPendingClose() && !rit.isSplitting()) { // Skip regions that were in transition unless CLOSING or // PENDING_CLOSE LOG.info( "Skip assigning region " + rit.toString()); If the region state is OPEN and server in RIT same as the dead server we can add it to toAssignRegions to assign. -> Before forcing region state to offline we are checking current state and if its other than CLOSED or OFFLINE(forceNewPlan is true in case of assign call from TM) we calling unassing on the region. if the server in the RIT is dead we are deleting znode if its in M_ZK_REGION_CLOSING or RS_ZK_REGION_CLOSED but in case of PENDING_OPEN,OPENING,OPEN case znode state wont be both. May be we can avoid two network calls to zookeeper here by having region state check. if (!serverManager.isServerOnline(server)) { // delete the node. if no node exists need not bother. deleteClosingOrClosedNode(region); regionOffline(region); return ; } Please correct me if wrong. Great work Jimmy.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Good review Rajesh.
          Yes the OPEN thing seems to be handled. Also we have the code guarding

          else if (addressFromAM != null
                          && !addressFromAM.equals(this.serverName)) {
                        LOG.debug("Skip assigning region "
                              + e.getKey().getRegionNameAsString()
                              + " because it has been opened in "
                              + addressFromAM.getServerName());
          

          if the server name got changed just after OPEN got processed. This should be fine.

          Show
          ramkrishna.s.vasudevan added a comment - Good review Rajesh. Yes the OPEN thing seems to be handled. Also we have the code guarding else if (addressFromAM != null && !addressFromAM.equals( this .serverName)) { LOG.debug( "Skip assigning region " + e.getKey().getRegionNameAsString() + " because it has been opened in " + addressFromAM.getServerName()); if the server name got changed just after OPEN got processed. This should be fine.
          Hide
          Jimmy Xiang added a comment -

          Good review. Thanks. I posted a new patch on RB: https://reviews.apache.org/r/7767/

          In the new patch, I re-organized the SSH code a little bit, and handled regions in OPEN state too.
          It also handles regions in OPENING/PENDING_OPEN/OFFLINE state, but was open on this region
          server before that. This could happen when the cluster starts up.

          Show
          Jimmy Xiang added a comment - Good review. Thanks. I posted a new patch on RB: https://reviews.apache.org/r/7767/ In the new patch, I re-organized the SSH code a little bit, and handled regions in OPEN state too. It also handles regions in OPENING/PENDING_OPEN/OFFLINE state, but was open on this region server before that. This could happen when the cluster starts up.
          Hide
          Lars Hofhansl added a comment -

          What's the feeling w.r.t. 0.92 and 0.94?

          Show
          Lars Hofhansl added a comment - What's the feeling w.r.t. 0.92 and 0.94?
          Hide
          Jimmy Xiang added a comment -

          @Rajesh, are you ok with the latest patch on RB for trunk?

          @Lars, I am fine with letting it go with 0.92 and 0.94. My patch won't work well in 0.92/0.94 since it depends on other patches which are too big to backport.

          Show
          Jimmy Xiang added a comment - @Rajesh, are you ok with the latest patch on RB for trunk? @Lars, I am fine with letting it go with 0.92 and 0.94. My patch won't work well in 0.92/0.94 since it depends on other patches which are too big to backport.
          Hide
          Lars Hofhansl added a comment -

          Removed from 0.92 and 0.94.

          Show
          Lars Hofhansl added a comment - Removed from 0.92 and 0.94.
          Hide
          rajeshbabu added a comment -

          @Jimmy
          Its good. +1

          Show
          rajeshbabu added a comment - @Jimmy Its good. +1
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12551735/trunk-6060_v3.3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          -1 javadoc. The javadoc tool appears to have generated 85 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.client.TestFromClientSideWithCoprocessor

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3210//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3210//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3210//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3210//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3210//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3210//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3210//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12551735/trunk-6060_v3.3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified tests. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 85 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 findbugs . The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.client.TestFromClientSideWithCoprocessor Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3210//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3210//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3210//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3210//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3210//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3210//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3210//console This message is automatically generated.
          Hide
          Jimmy Xiang added a comment -

          Integrated into trunk. Thanks all.

          Show
          Jimmy Xiang added a comment - Integrated into trunk. Thanks all.
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #3506 (See https://builds.apache.org/job/HBase-TRUNK/3506/)
          HBASE-6060 Regions's in OPENING state from failed regionservers takes a long time to recover (Revision 1404759)

          Result = FAILURE
          jxiang :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionState.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #3506 (See https://builds.apache.org/job/HBase-TRUNK/3506/ ) HBASE-6060 Regions's in OPENING state from failed regionservers takes a long time to recover (Revision 1404759) Result = FAILURE jxiang : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionState.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #247 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/247/)
          HBASE-6060 Regions's in OPENING state from failed regionservers takes a long time to recover (Revision 1404759)

          Result = FAILURE
          jxiang :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionState.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #247 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/247/ ) HBASE-6060 Regions's in OPENING state from failed regionservers takes a long time to recover (Revision 1404759) Result = FAILURE jxiang : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionState.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
          Hide
          Sergey Shelukhin added a comment -

          I can easily repro this on 0.94.

          Show
          Sergey Shelukhin added a comment - I can easily repro this on 0.94.
          Hide
          Sergey Shelukhin added a comment -

          oh, didn't read all the comments, nevermind

          Show
          Sergey Shelukhin added a comment - oh, didn't read all the comments, nevermind
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #839 (See https://builds.apache.org/job/HBase-0.94/839/)
          HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 (Sergey and Rajeshbabu) (Revision 1445350)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/Mocking.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
          Show
          Hudson added a comment - Integrated in HBase-0.94 #839 (See https://builds.apache.org/job/HBase-0.94/839/ ) HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 (Sergey and Rajeshbabu) (Revision 1445350) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/Mocking.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security #110 (See https://builds.apache.org/job/HBase-0.94-security/110/)
          HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 (Sergey and Rajeshbabu) (Revision 1445350)

          Result = SUCCESS
          tedyu :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/Mocking.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security #110 (See https://builds.apache.org/job/HBase-0.94-security/110/ ) HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 (Sergey and Rajeshbabu) (Revision 1445350) Result = SUCCESS tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/Mocking.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security-on-Hadoop-23 #12 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/12/)
          HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 (Sergey and Rajeshbabu) (Revision 1445350)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/Mocking.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security-on-Hadoop-23 #12 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/12/ ) HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 (Sergey and Rajeshbabu) (Revision 1445350) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/Mocking.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
          Hide
          stack added a comment -

          Marking closed.

          Show
          stack added a comment - Marking closed.

            People

            • Assignee:
              Jimmy Xiang
              Reporter:
              Enis Soztutar
            • Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development