HBase
  1. HBase
  2. HBASE-7521

fix HBASE-6060 (regions stuck in opening state) in 0.94

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.94.6
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Discussion in HBASE-6060 implies that the fix there does not work on 0.94. Still, we may want to fix the issue in 0.94 (via some different fix) because the regions stuck in opening for ridiculous amounts of time is not a good thing to have.

      1. HBASE-7521-v5.patch
        62 kB
        Sergey Shelukhin
      2. HBASE-7521-v1.patch
        12 kB
        Sergey Shelukhin
      3. HBASE-7521-v0.patch
        6 kB
        Sergey Shelukhin
      4. HBASE-7521-original-patch-ported-v3.patch
        58 kB
        Sergey Shelukhin
      5. HBASE-7521-original-patch-ported-v2.patch
        58 kB
        Sergey Shelukhin
      6. HBASE-7521-original-patch-ported-v1.patch
        58 kB
        Sergey Shelukhin
      7. HBASE-7521-original-patch-ported-v0.patch
        58 kB
        Sergey Shelukhin
      8. 7521-original-patch-ported-v4.patch
        59 kB
        Ted Yu

        Activity

        Hide
        Sergey Shelukhin added a comment -

        I have a patch, let me run it on cluster and post if it works...

        Show
        Sergey Shelukhin added a comment - I have a patch, let me run it on cluster and post if it works...
        Hide
        Sergey Shelukhin added a comment -

        Here's the patch. In artifically created scenario (move region to server and kill it) it recovers the region successfully.

        Show
        Sergey Shelukhin added a comment - Here's the patch. In artifically created scenario (move region to server and kill it) it recovers the region successfully.
        Hide
        Sergey Shelukhin added a comment -

        Nicolas Liochon Hi. Do I understand correctly that you are the AssignmentManager expert?

        Show
        Sergey Shelukhin added a comment - Nicolas Liochon Hi. Do I understand correctly that you are the AssignmentManager expert?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12564045/HBASE-7521-v0.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3954//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12564045/HBASE-7521-v0.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3954//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        Can you add tests ?
        Take a look at testAssignThroughBalancerOrFailedOpenAndSSHInParallel and testSSHWhenSourceRSandDestRSInRegionPlanGoneDown in https://issues.apache.org/jira/secure/attachment/12530962/6060-trunk_3.patch

        Show
        Ted Yu added a comment - Can you add tests ? Take a look at testAssignThroughBalancerOrFailedOpenAndSSHInParallel and testSSHWhenSourceRSandDestRSInRegionPlanGoneDown in https://issues.apache.org/jira/secure/attachment/12530962/6060-trunk_3.patch
        Hide
        Sergey Shelukhin added a comment -

        Copied one of the tests in modified form.
        The other test seems to be for a different bug specific to 0.94 (HBASE-5816) that I do not address here.
        The patch only hijacks regions in pending-open/opening, so as far as I see it should be no less safe than having the timeout callback hijack them (because balancer will not balance these regions as they are not assigned to any server, there's no potential for conflict with it).

        Show
        Sergey Shelukhin added a comment - Copied one of the tests in modified form. The other test seems to be for a different bug specific to 0.94 ( HBASE-5816 ) that I do not address here. The patch only hijacks regions in pending-open/opening, so as far as I see it should be no less safe than having the timeout callback hijack them (because balancer will not balance these regions as they are not assigned to any server, there's no potential for conflict with it).
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12564078/HBASE-7521-v1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3955//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12564078/HBASE-7521-v1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified tests. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3955//console This message is automatically generated.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Sergey
        Before going thro the patch, just wanted to tell one thing that, the initial patches in HBASE-6060 was something which could be directly applied on 0.94. But due to discussions about the usage of RegionPlans and RITs, decided to go with some new approach for the trunk version. Anyway i will go thro this patch today and let you know. Thanks Sergey.
        And surely this needs to be fixed in 0.94 and 0.92. Otherwise this is a major open area in AM in these branches.

        Show
        ramkrishna.s.vasudevan added a comment - @Sergey Before going thro the patch, just wanted to tell one thing that, the initial patches in HBASE-6060 was something which could be directly applied on 0.94. But due to discussions about the usage of RegionPlans and RITs, decided to go with some new approach for the trunk version. Anyway i will go thro this patch today and let you know. Thanks Sergey. And surely this needs to be fixed in 0.94 and 0.92. Otherwise this is a major open area in AM in these branches.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Sergey
        I had a glance at the patch. I think the SSH and the retry logic in assign on seeing an RS is down will lead to race conditions which is handled in the patches in HBASE-6060.
        RAJESHBABU CHINTAGUNTLA
        Wanna take a look at this? I know you have a version of this running in your cluster for 0.94.

        Show
        ramkrishna.s.vasudevan added a comment - @Sergey I had a glance at the patch. I think the SSH and the retry logic in assign on seeing an RS is down will lead to race conditions which is handled in the patches in HBASE-6060 . RAJESHBABU CHINTAGUNTLA Wanna take a look at this? I know you have a version of this running in your cluster for 0.94.
        Hide
        Sergey Shelukhin added a comment -

        Can you please elaborate on race conditions? Do you mean HBASE-5816?
        As far as I can see this patch preserves existing race conditions but doesn't add new ones
        Although, my experience with AM is limited, even more so in 94.

        We can try to rebase latest 094 patch from HBASE-6060 instead...

        Show
        Sergey Shelukhin added a comment - Can you please elaborate on race conditions? Do you mean HBASE-5816 ? As far as I can see this patch preserves existing race conditions but doesn't add new ones Although, my experience with AM is limited, even more so in 94. We can try to rebase latest 094 patch from HBASE-6060 instead...
        Hide
        rajeshbabu added a comment -

        Patch works fine for OPENING(HBASE-6060 patches also works fine in this case) but there are issues with PENDING_OPEN.

              for (RegionState rit : ritsNotYetOnServer) {
                if (rit.isPendingOpen() || rit.isOpening()) {
                  LOG.info("Hijacking and reassigning " + rit.getRegion().getRegionNameAsString() +
                    " that was on " + serverName + " in " + rit.getState() + " state.");
                  this.services.getAssignmentManager().assign(rit.getRegion(), true, true, true);
                }
              }
        

        Here if we see region in PENDING_OPEN on dead server we are assigning the region.
        In case of single region assign if we see a server is dead we will retry assign to some other region server.
        Main race condition can happen below as in HBASE-5816

            if (!hijack && !state.isClosed() && !state.isOffline()) {
              if (!regionAlreadyInTransitionException ) {
                String msg = "Unexpected state : " + state + " .. Cannot transit it to OFFLINE.";
                this.master.abort(msg, new IllegalStateException(msg));
                return -1;
              } 
              LOG.debug("Unexpected state : " + state
                  + " but retrying to assign because RegionAlreadyInTransitionException.");
            }
        

        One more thing is there is a possibility of double assignment also.

        In case of PENDING_OPEN we are not able to decide whether to retry or skip retry(as thinking SSH can handle) becuase there are multiple cases
        -> If RS went down after setting state to PENDING_OPEN, then SSH can assign the region(as for the patch)
        at the same time single assign also retry to assign because send open RPC will fail with connection refused exception
        Lets suppose if we skip the assign retry in case of connection refused exception then there is one problem with this approach.
        scenario is :
        1) got region plan
        2) destination server went down - ssh also processed (here we will see the region in offline state and skip assignment).
        3) change state to PENDING_OPEN
        4) then send open RPC fail with the connection refused exception and we will skip assign.

        -> If we skip the assign in SSH also problem only
        scenario is :
        1) got region plan and region in PENDING_OPEN
        2) just spawned OpenRegionHandler but didnt transition to OPENING
        3) single assign came out as thinking RS is up and it will take care
        4) RS went down
        5) now SSH also will skip the region assignment

        Show
        rajeshbabu added a comment - Patch works fine for OPENING( HBASE-6060 patches also works fine in this case) but there are issues with PENDING_OPEN. for (RegionState rit : ritsNotYetOnServer) { if (rit.isPendingOpen() || rit.isOpening()) { LOG.info( "Hijacking and reassigning " + rit.getRegion().getRegionNameAsString() + " that was on " + serverName + " in " + rit.getState() + " state." ); this .services.getAssignmentManager().assign(rit.getRegion(), true , true , true ); } } Here if we see region in PENDING_OPEN on dead server we are assigning the region. In case of single region assign if we see a server is dead we will retry assign to some other region server. Main race condition can happen below as in HBASE-5816 if (!hijack && !state.isClosed() && !state.isOffline()) { if (!regionAlreadyInTransitionException ) { String msg = "Unexpected state : " + state + " .. Cannot transit it to OFFLINE." ; this .master.abort(msg, new IllegalStateException(msg)); return -1; } LOG.debug( "Unexpected state : " + state + " but retrying to assign because RegionAlreadyInTransitionException." ); } One more thing is there is a possibility of double assignment also. In case of PENDING_OPEN we are not able to decide whether to retry or skip retry(as thinking SSH can handle) becuase there are multiple cases -> If RS went down after setting state to PENDING_OPEN, then SSH can assign the region(as for the patch) at the same time single assign also retry to assign because send open RPC will fail with connection refused exception Lets suppose if we skip the assign retry in case of connection refused exception then there is one problem with this approach. scenario is : 1) got region plan 2) destination server went down - ssh also processed (here we will see the region in offline state and skip assignment). 3) change state to PENDING_OPEN 4) then send open RPC fail with the connection refused exception and we will skip assign. -> If we skip the assign in SSH also problem only scenario is : 1) got region plan and region in PENDING_OPEN 2) just spawned OpenRegionHandler but didnt transition to OPENING 3) single assign came out as thinking RS is up and it will take care 4) RS went down 5) now SSH also will skip the region assignment
        Hide
        ramkrishna.s.vasudevan added a comment -

        Thanks Rajesh.
        @Sergey,
        The above condition of assign retry and SSH also trying to assign is what i was mentioning. Anyway Rajesh has explained with more details.

        Show
        ramkrishna.s.vasudevan added a comment - Thanks Rajesh. @Sergey, The above condition of assign retry and SSH also trying to assign is what i was mentioning. Anyway Rajesh has explained with more details.
        Hide
        Sergey Shelukhin added a comment -

        Hmm, I see. For the former, it seems like hijack flag should take care of that as in case with timeout. I'll look into the 2nd one.

        Show
        Sergey Shelukhin added a comment - Hmm, I see. For the former, it seems like hijack flag should take care of that as in case with timeout. I'll look into the 2nd one.
        Hide
        Sergey Shelukhin added a comment -

        Hi. I've read the discussion on 6060... can participants please comment if it makes sense to backport final patch (from August) to 0.94?
        The discussion is very long and covers many scenarios, and patch was not committed.
        Does it fix the original issue/is it still relevant?

        Show
        Sergey Shelukhin added a comment - Hi. I've read the discussion on 6060... can participants please comment if it makes sense to backport final patch (from August) to 0.94? The discussion is very long and covers many scenarios, and patch was not committed. Does it fix the original issue/is it still relevant?
        Hide
        Sergey Shelukhin added a comment -

        Alternatively I can add multi-assign prevention to cover the 2nd case/similar race conditions.

        Show
        Sergey Shelukhin added a comment - Alternatively I can add multi-assign prevention to cover the 2nd case/similar race conditions.
        Hide
        Sergey Shelukhin added a comment -

        Ping?

        Show
        Sergey Shelukhin added a comment - Ping?
        Hide
        Devaraj Das added a comment -

        Hey Sergey Shelukhin, can you please post the relevant patch from HBASE-6060 here to start the discussion from (and make sure it applies, etc.). Thanks!

        Show
        Devaraj Das added a comment - Hey Sergey Shelukhin , can you please post the relevant patch from HBASE-6060 here to start the discussion from (and make sure it applies, etc.). Thanks!
        Hide
        Sergey Shelukhin added a comment -

        Will try to rebase latest trunk patch from August. I got it around halfway done, looks like a lot changed around AM at that time, will see if it's viable.

        Show
        Sergey Shelukhin added a comment - Will try to rebase latest trunk patch from August. I got it around halfway done, looks like a lot changed around AM at that time, will see if it's viable.
        Hide
        Sergey Shelukhin added a comment -

        Ported the original patch. Some parts needed to be changed but mostly mechanical. Small and some medium (assignment manager, open/close) tests that I tried appear to pass. I will run all tests and report results here

        Show
        Sergey Shelukhin added a comment - Ported the original patch. Some parts needed to be changed but mostly mechanical. Small and some medium (assignment manager, open/close) tests that I tried appear to pass. I will run all tests and report results here
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12565619/HBASE-7521-original-patch-ported-v0.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 18 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4097//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12565619/HBASE-7521-original-patch-ported-v0.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 18 new or modified tests. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4097//console This message is automatically generated.
        Hide
        Sergey Shelukhin added a comment -

        mvn tests pass... I'll try to verify the fix using integration test.

        Show
        Sergey Shelukhin added a comment - mvn tests pass... I'll try to verify the fix using integration test.
        Hide
        Sergey Shelukhin added a comment -

        Integration test that I tweaked for the purpose appears to pass and picks up the stuck regions, judging by refreshing the master page repeatedly. Code review anyone? Thanks.

        Show
        Sergey Shelukhin added a comment - Integration test that I tweaked for the purpose appears to pass and picks up the stuck regions, judging by refreshing the master page repeatedly. Code review anyone? Thanks.
        Hide
        ramkrishna.s.vasudevan added a comment -

        rajeshbabu
        Could you take a look at this patch with the code base that you have. It will be easy to see if anything is missed out because Sergey has rebased the earlier patches that we had posted that time.
        I wil also have a look at this tomorrow during day time.

        Show
        ramkrishna.s.vasudevan added a comment - rajeshbabu Could you take a look at this patch with the code base that you have. It will be easy to see if anything is missed out because Sergey has rebased the earlier patches that we had posted that time. I wil also have a look at this tomorrow during day time.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Sergey
        Went thro the patch. It seems fine wrt to the changes taken up from the old patch.
        Just after looking into the patch now i feel may be we can rename some variables like regionsFromRIT, regionsIntersection. Is it possible to rename so that the name suggests what they are?
        Overall looks fine.
        One question is did you see what is the impact on now making the RS to transition the node to OPENING? Thanks Sergey

        Show
        ramkrishna.s.vasudevan added a comment - @Sergey Went thro the patch. It seems fine wrt to the changes taken up from the old patch. Just after looking into the patch now i feel may be we can rename some variables like regionsFromRIT, regionsIntersection. Is it possible to rename so that the name suggests what they are? Overall looks fine. One question is did you see what is the impact on now making the RS to transition the node to OPENING? Thanks Sergey
        Hide
        Sergey Shelukhin added a comment -

        rename some variables like regionsFromRIT, regionsIntersection.

        sure.

        what is the impact on now making the RS to transition the node to OPENING? Thanks Sergey

        Can you please clarify?

        Show
        Sergey Shelukhin added a comment - rename some variables like regionsFromRIT, regionsIntersection. sure. what is the impact on now making the RS to transition the node to OPENING? Thanks Sergey Can you please clarify?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12566837/HBASE-7521-original-patch-ported-v1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 18 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4219//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566837/HBASE-7521-original-patch-ported-v1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 18 new or modified tests. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4219//console This message is automatically generated.
        Hide
        Ted Yu added a comment -
        +   * @return Pair that has all regionplans that pertain to this dead server and a list that has
        ...
        +  public Pair<Set<HRegionInfo>, List<RegionState>> processServerShutdown(final ServerName sn) {
        

        Better use term other than regionplan because the Set contains HRegionInfo's.

        +   * @param deadServerRITs Regions that were in transition against the dead server
        +   * @return List of regions to assign or null if aborting.
        +   * @throws IOException
        +   */
        +  private List<HRegionInfo> getRegionsToAssign(final NavigableMap<HRegionInfo, Result> metaHRIs,
        +      final List<RegionState> deadServerRITs, Set<HRegionInfo> regionsFromRIT) throws IOException {
        

        Please add javadoc for the new param - regionsFromRIT.

        +    // Collect regions to bulk assign here.
        

        The regions identified by getRegionsToAssign() are not bulk assigned - see line 298.
        The comments below the above should be modified as well.

        Show
        Ted Yu added a comment - + * @ return Pair that has all regionplans that pertain to this dead server and a list that has ... + public Pair<Set<HRegionInfo>, List<RegionState>> processServerShutdown( final ServerName sn) { Better use term other than regionplan because the Set contains HRegionInfo's. + * @param deadServerRITs Regions that were in transition against the dead server + * @ return List of regions to assign or null if aborting. + * @ throws IOException + */ + private List<HRegionInfo> getRegionsToAssign( final NavigableMap<HRegionInfo, Result> metaHRIs, + final List<RegionState> deadServerRITs, Set<HRegionInfo> regionsFromRIT) throws IOException { Please add javadoc for the new param - regionsFromRIT. + // Collect regions to bulk assign here. The regions identified by getRegionsToAssign() are not bulk assigned - see line 298. The comments below the above should be modified as well.
        Hide
        ramkrishna.s.vasudevan added a comment -

        what is the impact on now making the RS to transition the node to OPENING? Thanks Sergey

        Now the main RPC thread for asking the RS to open the region is doing an additional work of transitioning then node to OPENING and then calling the OpenRegionHandlerThread.
        This should have an impact right?
        But the above change is needed if the RS goes down before changing the state to PENDING_OPEN as said in HBASE-6060.

        Show
        ramkrishna.s.vasudevan added a comment - what is the impact on now making the RS to transition the node to OPENING? Thanks Sergey Now the main RPC thread for asking the RS to open the region is doing an additional work of transitioning then node to OPENING and then calling the OpenRegionHandlerThread. This should have an impact right? But the above change is needed if the RS goes down before changing the state to PENDING_OPEN as said in HBASE-6060 .
        Hide
        Sergey Shelukhin added a comment -

        Sorry, what I don't understand is what you mean by impact? I changes the logic... you mean in a sense of racing with some other code?

        Show
        Sergey Shelukhin added a comment - Sorry, what I don't understand is what you mean by impact? I changes the logic... you mean in a sense of racing with some other code?
        Hide
        ramkrishna.s.vasudevan added a comment -

        Impact in the sense
        Performance wise if any.
        Any possibility of race conditions. I remember we were able to cover all. But did not do intense testing with this code change as we had a slightly modified internal version which is there in HBASE-6060 as earlier patches.

        Show
        ramkrishna.s.vasudevan added a comment - Impact in the sense Performance wise if any. Any possibility of race conditions. I remember we were able to cover all. But did not do intense testing with this code change as we had a slightly modified internal version which is there in HBASE-6060 as earlier patches.
        Hide
        Lars Hofhansl added a comment -

        Sergey Shelukhin So you're proposing the ported patches?

        Show
        Lars Hofhansl added a comment - Sergey Shelukhin So you're proposing the ported patches?
        Hide
        Lars Hofhansl added a comment -

        That looks like a hard to verify change (in terms of correctness).
        How are we making sure that this not introducing new problems?

        Seems to me this is a better candidate for the next train (0.94.6)
        But I'm not the decider... If you guys want this in 0.94.5 I think we need to build some confidence that this change is correct.
        At the very least someone should run the entire test suite a few times over night with this patch.

        Show
        Lars Hofhansl added a comment - That looks like a hard to verify change (in terms of correctness). How are we making sure that this not introducing new problems? Seems to me this is a better candidate for the next train (0.94.6) But I'm not the decider... If you guys want this in 0.94.5 I think we need to build some confidence that this change is correct. At the very least someone should run the entire test suite a few times over night with this patch.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Lars
        Personally i feel this fix should go as soon as possible. Infact this itself is quite a delay knowing that we have a whole in AM in this case.
        So i feel we can get this in this version.
        As suggested surely good testing is needed. Also i need to add that am not in a situation to do it also.
        Just saying what i feel.

        Show
        ramkrishna.s.vasudevan added a comment - @Lars Personally i feel this fix should go as soon as possible. Infact this itself is quite a delay knowing that we have a whole in AM in this case. So i feel we can get this in this version. As suggested surely good testing is needed. Also i need to add that am not in a situation to do it also. Just saying what i feel.
        Hide
        Lars Hofhansl added a comment -

        Enis Soztutar and Sergey Shelukhin, do you guys have some bandwidth to run this through your grindmill?
        ramkrishna.s.vasudevan You guys have run a version similar to this (earlier version of HBASE-6060) in production?

        Show
        Lars Hofhansl added a comment - Enis Soztutar and Sergey Shelukhin , do you guys have some bandwidth to run this through your grindmill? ramkrishna.s.vasudevan You guys have run a version similar to this (earlier version of HBASE-6060 ) in production?
        Hide
        Enis Soztutar added a comment -

        Since this original issue HBASE-6060 was caught with the integration tests, it is easily reproducible. I think we can at least run those to see that it fixes the problem. Apart from that, I think we can spare some time on running the full tests, and the whole integration test suite tomorrow.

        Show
        Enis Soztutar added a comment - Since this original issue HBASE-6060 was caught with the integration tests, it is easily reproducible. I think we can at least run those to see that it fixes the problem. Apart from that, I think we can spare some time on running the full tests, and the whole integration test suite tomorrow.
        Hide
        Lars Hofhansl added a comment -

        Cool. Maybe I shoot for pulling this in tomorrow evening and cutting the first RC then. Thanks!

        Show
        Lars Hofhansl added a comment - Cool. Maybe I shoot for pulling this in tomorrow evening and cutting the first RC then. Thanks!
        Hide
        Sergey Shelukhin added a comment -

        I've run the integration tests to verify... I can run more

        Show
        Sergey Shelukhin added a comment - I've run the integration tests to verify... I can run more
        Hide
        Sergey Shelukhin added a comment -

        Addressed Ted's feedback, renamed more variables

        Show
        Sergey Shelukhin added a comment - Addressed Ted's feedback, renamed more variables
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12567220/HBASE-7521-original-patch-ported-v2.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 18 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4261//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567220/HBASE-7521-original-patch-ported-v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 18 new or modified tests. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4261//console This message is automatically generated.
        Hide
        ramkrishna.s.vasudevan added a comment -

        You guys have run a version similar to this (earlier version of HBASE-6060) in production?

        Yes but i should say it was my previous employer. It should be still running.

        Show
        ramkrishna.s.vasudevan added a comment - You guys have run a version similar to this (earlier version of HBASE-6060 ) in production? Yes but i should say it was my previous employer. It should be still running.
        Hide
        Enis Soztutar added a comment -

        I have tested 0.94-tip with and without the patch while running IntegrationTestDataIngestWithChaosMonkey, which is CM + LoadTestTool. The cluster was a 3 node EC2 running Hadoop1.0 from HDP. Without the patch, test succeeds in a 5 min run, however with the patch, my 2 attempts failed, with lots of ServerNotRunningYetException, plus Failed to write key: 3282. Also noticed that it took way too long to run the ingest (7500 inserts took ~15 min vs ~1:30 with the patch).

        Overall, the patch looks good from my limited understanding of the AM+SSH code base. Everytime I try to study it, I am running away horrified

        One question about the patch is:
        why don't we do the same thing in HRegionServer.closeRegion() and set the znode in the RPC handler?

        So, if we want to cut 0.94.5 tomorrow, I guess we can't include this without further testing.

        Show
        Enis Soztutar added a comment - I have tested 0.94-tip with and without the patch while running IntegrationTestDataIngestWithChaosMonkey, which is CM + LoadTestTool. The cluster was a 3 node EC2 running Hadoop1.0 from HDP. Without the patch, test succeeds in a 5 min run, however with the patch, my 2 attempts failed, with lots of ServerNotRunningYetException, plus Failed to write key: 3282. Also noticed that it took way too long to run the ingest (7500 inserts took ~15 min vs ~1:30 with the patch). Overall, the patch looks good from my limited understanding of the AM+SSH code base. Everytime I try to study it, I am running away horrified One question about the patch is: why don't we do the same thing in HRegionServer.closeRegion() and set the znode in the RPC handler? So, if we want to cut 0.94.5 tomorrow, I guess we can't include this without further testing.
        Hide
        Lars Hofhansl added a comment -

        Also noticed that it took way too long to run the ingest (7500 inserts took ~15 min vs ~1:30 with the patch).

        You mean it took 1:30 without the patch?

        Show
        Lars Hofhansl added a comment - Also noticed that it took way too long to run the ingest (7500 inserts took ~15 min vs ~1:30 with the patch). You mean it took 1:30 without the patch?
        Hide
        Lars Hofhansl added a comment -

        Moving to 0.94.6, we can always pull it back.

        Show
        Lars Hofhansl added a comment - Moving to 0.94.6, we can always pull it back.
        Hide
        Lars Hofhansl added a comment -

        And thanks for the testing, Enis!

        Show
        Lars Hofhansl added a comment - And thanks for the testing, Enis!
        Hide
        Sergey Shelukhin added a comment -

        What were the CM actions? Given the range of possibilities I wonder if it could be worse things done by CM on 2 last runs.
        When I was running a "consistent" test (...Targeted) I didn't observe this behavior

        Show
        Sergey Shelukhin added a comment - What were the CM actions? Given the range of possibilities I wonder if it could be worse things done by CM on 2 last runs. When I was running a "consistent" test (...Targeted) I didn't observe this behavior
        Hide
        Sergey Shelukhin added a comment -

        I am looking into this today btw. I think I may have found a bug.

        Show
        Sergey Shelukhin added a comment - I am looking into this today btw. I think I may have found a bug.
        Hide
        Sergey Shelukhin added a comment -

        The test passes for me with this small change (NPE in SSH).

        Show
        Sergey Shelukhin added a comment - The test passes for me with this small change (NPE in SSH).
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12567921/HBASE-7521-original-patch-ported-v3.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 18 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4325//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567921/HBASE-7521-original-patch-ported-v3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 18 new or modified tests. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4325//console This message is automatically generated.
        Hide
        Enis Soztutar added a comment -

        Sergey Shelukhin, which test? You mean the unit tests or the integration test on the actual cluster?

        Show
        Enis Soztutar added a comment - Sergey Shelukhin , which test? You mean the unit tests or the integration test on the actual cluster?
        Hide
        Sergey Shelukhin added a comment -

        Integration test on EC2

        Show
        Sergey Shelukhin added a comment - Integration test on EC2
        Hide
        rajeshbabu added a comment -

        Patch looks good.

        Show
        rajeshbabu added a comment - Patch looks good.
        Hide
        Enis Soztutar added a comment -

        Let me test this one more time, then we can commit.

        Show
        Enis Soztutar added a comment - Let me test this one more time, then we can commit.
        Hide
        Sergey Shelukhin added a comment -

        Enis Soztutar Any update? Thanks!

        Show
        Sergey Shelukhin added a comment - Enis Soztutar Any update? Thanks!
        Hide
        Jimmy Xiang added a comment -

        The patch depends on the region states in the master memory (AM). The issue is that in 0.94, the region states in memory are not reliable, i.e., it could be wrong sometimes due to lack of proper synchronization. That's why we didn't try to fix HBASE-6060 in 0.94. However, if the community is ok with the patch, it is fine with me too.

        @Ram, could you take a look at the patch?

        Show
        Jimmy Xiang added a comment - The patch depends on the region states in the master memory (AM). The issue is that in 0.94, the region states in memory are not reliable, i.e., it could be wrong sometimes due to lack of proper synchronization. That's why we didn't try to fix HBASE-6060 in 0.94. However, if the community is ok with the patch, it is fine with me too. @Ram, could you take a look at the patch?
        Hide
        ramkrishna.s.vasudevan added a comment -
        if ((state.getServerName() != null) && state.getServerName().equals(sn)) {
        

        This new change has been added because when in OFFLINE state we don't populate the servername right?

        + ", which we are already trying to " + " ,which we are already trying to "
        

        This msg is coming twice.

        try {
              addRegionsInTransition(region, CLOSE);
            } catch (RegionAlreadyInTransitionException rate) {
              LOG.warn("Received close for region we are already opening or closing; "
                  + region.getEncodedName());
              return false;
            }
        

        Can we throw the exception instead of catching it as in openRegion()?

        Rest looks the same.
        @Jimmy
        Prior to your changes in Trunk wrt to AM things were messy. But i feel that if we get it in atleast the problem addressed by this JIRA is solved. There may be some issues hiding some where but the issue addressed here is pretty much big.
        Sergey's integration suite is a good one that tries to catch up many corner cases. I am +1 on getting this to 0.94.

        Show
        ramkrishna.s.vasudevan added a comment - if ((state.getServerName() != null ) && state.getServerName().equals(sn)) { This new change has been added because when in OFFLINE state we don't populate the servername right? + ", which we are already trying to " + " ,which we are already trying to " This msg is coming twice. try { addRegionsInTransition(region, CLOSE); } catch (RegionAlreadyInTransitionException rate) { LOG.warn( "Received close for region we are already opening or closing; " + region.getEncodedName()); return false ; } Can we throw the exception instead of catching it as in openRegion()? Rest looks the same. @Jimmy Prior to your changes in Trunk wrt to AM things were messy. But i feel that if we get it in atleast the problem addressed by this JIRA is solved. There may be some issues hiding some where but the issue addressed here is pretty much big. Sergey's integration suite is a good one that tries to catch up many corner cases. I am +1 on getting this to 0.94.
        Hide
        Sergey Shelukhin added a comment -

        This new change has been added because when in OFFLINE state we don't populate the servername right?

        Yes.

        Can we throw the exception instead of catching it as in openRegion()?

        Seems like a normal case (multiple close-s, or close during open)?

        Show
        Sergey Shelukhin added a comment - This new change has been added because when in OFFLINE state we don't populate the servername right? Yes. Can we throw the exception instead of catching it as in openRegion()? Seems like a normal case (multiple close-s, or close during open)?
        Hide
        Ted Yu added a comment -

        Patch v4 rebases on latest 0.94

        Addresses Ram's comment about duplicate message.

        Show
        Ted Yu added a comment - Patch v4 rebases on latest 0.94 Addresses Ram's comment about duplicate message.
        Hide
        Sergey Shelukhin added a comment -

        fixed test issue

        Show
        Sergey Shelukhin added a comment - fixed test issue
        Hide
        Jimmy Xiang added a comment -

        @Ram, sure. I am ok with getting this to 0.94. I was just in doubt this will fix all racing issues. For example, the following code (in #assign) is synchronized on the regionstate. However, AM#handleRegion is synchronized on regionsInTransition. Suppose serverManager.sendRegionOpen returns RegionOpeningState.OPENED, but OpenRegionHandler fails to open it and transitions the ZK node to RS_ZK_REGION_FAILED_OPEN, then handleRegion will transition the region state to Closed, right after that, this thread transitions the state back to PENDING_OPEN, which may cause some confusion further down the road.

        +        RegionOpeningState regionOpenState = serverManager.sendRegionOpen(plan.getDestination(),
        +            state.getRegion(), versionOfOfflineNode);
        +        if (regionOpenState == RegionOpeningState.OPENED) {
        +          // Transition RegionState to PENDING_OPEN
        +          if (state.isOffline() && !state.isOpening()) {
        +            state.update(RegionState.State.PENDING_OPEN,
        +                System.currentTimeMillis(), plan.getDestination());
        +          }
        
        Show
        Jimmy Xiang added a comment - @Ram, sure. I am ok with getting this to 0.94. I was just in doubt this will fix all racing issues. For example, the following code (in #assign) is synchronized on the regionstate. However, AM#handleRegion is synchronized on regionsInTransition. Suppose serverManager.sendRegionOpen returns RegionOpeningState.OPENED, but OpenRegionHandler fails to open it and transitions the ZK node to RS_ZK_REGION_FAILED_OPEN, then handleRegion will transition the region state to Closed, right after that, this thread transitions the state back to PENDING_OPEN, which may cause some confusion further down the road. + RegionOpeningState regionOpenState = serverManager.sendRegionOpen(plan.getDestination(), + state.getRegion(), versionOfOfflineNode); + if (regionOpenState == RegionOpeningState.OPENED) { + // Transition RegionState to PENDING_OPEN + if (state.isOffline() && !state.isOpening()) { + state.update(RegionState.State.PENDING_OPEN, + System .currentTimeMillis(), plan.getDestination()); + }
        Hide
        Ted Yu added a comment -

        Ran through 0.94 test suite:

        Tests run: 1272, Failures: 0, Errors: 0, Skipped: 13
        
        [INFO] ------------------------------------------------------------------------
        [INFO] BUILD SUCCESS
        [INFO] ------------------------------------------------------------------------
        [INFO] Total time: 46:17.736s
        [INFO] Finished at: Mon Feb 11 21:08:22 UTC 2013
        

        Plan to get the patch in by Friday.

        Show
        Ted Yu added a comment - Ran through 0.94 test suite: Tests run: 1272, Failures: 0, Errors: 0, Skipped: 13 [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 46:17.736s [INFO] Finished at: Mon Feb 11 21:08:22 UTC 2013 Plan to get the patch in by Friday.
        Hide
        Enis Soztutar added a comment -

        +1. I've been testing this patch for the last half hour or so with IntegrationTestDataIngestWithChaosMonkey, and hbck. It passes the tests most of the time, without running into the previous RIT's in OPENING state.

        Show
        Enis Soztutar added a comment - +1. I've been testing this patch for the last half hour or so with IntegrationTestDataIngestWithChaosMonkey, and hbck. It passes the tests most of the time, without running into the previous RIT's in OPENING state.
        Hide
        Ted Yu added a comment -

        So far Ram, Rajesh, Enis and I gave +1 on this issue.

        Plan to integrate tomorrow if there is no objection.

        Show
        Ted Yu added a comment - So far Ram, Rajesh, Enis and I gave +1 on this issue. Plan to integrate tomorrow if there is no objection.
        Hide
        Ted Yu added a comment -

        Lars Hofhansl:
        It would be nice to get your endorsement.

        Show
        Ted Yu added a comment - Lars Hofhansl : It would be nice to get your endorsement.
        Hide
        Lars Hofhansl added a comment -

        I'm -0 on this change. To me this looks like a pretty hairy change for 0.94.
        That said, if you guys are convinced this is right thing to do, please go ahead and integrate into 0.94.

        I assume this will not prevent a rolling upgrade from 0.94.5 to 0.94.6 (or from 0.92.x to 0.94.6).

        Show
        Lars Hofhansl added a comment - I'm -0 on this change. To me this looks like a pretty hairy change for 0.94. That said, if you guys are convinced this is right thing to do, please go ahead and integrate into 0.94. I assume this will not prevent a rolling upgrade from 0.94.5 to 0.94.6 (or from 0.92.x to 0.94.6).
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Sergey
        Do you think any problem with rolling restart with this?
        I don't have experience with rolling restart. Also seeing the changes it does not introduce new states nor changes the order of the states.
        @Jimmy
        I agree that there may be some cases hidden. We still see in user mail list people saying Regions are in transition and they are not sure how to get around it.
        Atleast this patch solves the obvious cases if region in OPENING state.
        Jimmy, do you think this will affect rolling restart?
        @Stack
        What do you think before we integrate to 0.94?

        Show
        ramkrishna.s.vasudevan added a comment - @Sergey Do you think any problem with rolling restart with this? I don't have experience with rolling restart. Also seeing the changes it does not introduce new states nor changes the order of the states. @Jimmy I agree that there may be some cases hidden. We still see in user mail list people saying Regions are in transition and they are not sure how to get around it. Atleast this patch solves the obvious cases if region in OPENING state. Jimmy, do you think this will affect rolling restart? @Stack What do you think before we integrate to 0.94?
        Hide
        stack added a comment -

        What do I think? This is a big change in AM internals. I see Enis has tested it above but it only passes "most of the time". Any chance of some more testing before it gets committed?

        Show
        stack added a comment - What do I think? This is a big change in AM internals. I see Enis has tested it above but it only passes "most of the time". Any chance of some more testing before it gets committed?
        Hide
        Sergey Shelukhin added a comment -

        Why should there be problem with rolling restart in particular, as opposed to regular restart?
        I tested with regular restart and it appeared to be normal, although integration tests on 94 are not very stable to begin with.

        Alternatively, we can consider improving on the simpler patch I submitted (it fixes only specific issue in this JIRA, not the other one; but it's straightforward and as far as I was told there's only one (existing) race to fix to make it work, between retry and immediate reassignment).

        Show
        Sergey Shelukhin added a comment - Why should there be problem with rolling restart in particular, as opposed to regular restart? I tested with regular restart and it appeared to be normal, although integration tests on 94 are not very stable to begin with. Alternatively, we can consider improving on the simpler patch I submitted (it fixes only specific issue in this JIRA, not the other one; but it's straightforward and as far as I was told there's only one (existing) race to fix to make it work, between retry and immediate reassignment).
        Hide
        Jimmy Xiang added a comment -

        This patch removed a method from RegionServerServices. If some coprocessor happens to use it, it will be broken. It also added two new methods to this class. Other than that, I don't see other rolling restart/compatibility issue. However, the chance for a coprocessor to use these methods are very low.

        Show
        Jimmy Xiang added a comment - This patch removed a method from RegionServerServices. If some coprocessor happens to use it, it will be broken. It also added two new methods to this class. Other than that, I don't see other rolling restart/compatibility issue. However, the chance for a coprocessor to use these methods are very low.
        Hide
        Enis Soztutar added a comment -

        but it only passes "most of the time"

        Sorry I should have been more clear. The integration test actually does not pass all the time, because of other problems, not directly related to region assignment. We are running out of retries and giving up on write's. I should inspect more and create issues for that. Without the patch, the integration test does not even complete half the time because of the RITs in OPENING state.

        Show
        Enis Soztutar added a comment - but it only passes "most of the time" Sorry I should have been more clear. The integration test actually does not pass all the time, because of other problems, not directly related to region assignment. We are running out of retries and giving up on write's. I should inspect more and create issues for that. Without the patch, the integration test does not even complete half the time because of the RITs in OPENING state.
        Hide
        Ted Yu added a comment -

        This patch removed a method from RegionServerServices.

        I noticed this too.

        -  public Map<byte[], Boolean> getRegionsInTransitionInRS();
        +  public boolean removeFromRegionsInTransition(HRegionInfo hri);
        

        getRegionsInTransitionInRS() is supposed to be internal (from trunk):

        @InterfaceAudience.Private
        public interface RegionServerServices extends OnlineRegions {
        
        Show
        Ted Yu added a comment - This patch removed a method from RegionServerServices. I noticed this too. - public Map< byte [], Boolean > getRegionsInTransitionInRS(); + public boolean removeFromRegionsInTransition(HRegionInfo hri); getRegionsInTransitionInRS() is supposed to be internal (from trunk): @InterfaceAudience.Private public interface RegionServerServices extends OnlineRegions {
        Hide
        Lars Hofhansl added a comment -

        I specifically meant a rolling upgrade where some servers might still run 0.94.x (or even 0.92.x) and some will run 0.94.6.
        The coprocessor bit is fine I think (we didn't say we'll keep the internal APIs stable).

        Again, I'm not the decider. If you think this is an important fix and there reasonable confidence that this won't introduce any bad side-effects then please integrate.

        Show
        Lars Hofhansl added a comment - I specifically meant a rolling upgrade where some servers might still run 0.94.x (or even 0.92.x) and some will run 0.94.6. The coprocessor bit is fine I think (we didn't say we'll keep the internal APIs stable). Again, I'm not the decider. If you think this is an important fix and there reasonable confidence that this won't introduce any bad side-effects then please integrate.
        Hide
        Ted Yu added a comment -

        I would vote for this patch to go in.

        HBASE-6060 was opened about 9 months ago. We should address this issue.

        Show
        Ted Yu added a comment - I would vote for this patch to go in. HBASE-6060 was opened about 9 months ago. We should address this issue.
        Hide
        Lars Hofhansl added a comment -

        There are already enough +1's to commit.

        Show
        Lars Hofhansl added a comment - There are already enough +1's to commit.
        Hide
        Ted Yu added a comment -

        Integrated to 0.94

        Thanks for the patch, Sergey.

        Thanks for the reviews, Stack, Lars, Ram, Rajesh and Enis.

        Show
        Ted Yu added a comment - Integrated to 0.94 Thanks for the patch, Sergey. Thanks for the reviews, Stack, Lars, Ram, Rajesh and Enis.
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #839 (See https://builds.apache.org/job/HBase-0.94/839/)
        HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 (Sergey and Rajeshbabu) (Revision 1445350)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/Mocking.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #839 (See https://builds.apache.org/job/HBase-0.94/839/ ) HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 (Sergey and Rajeshbabu) (Revision 1445350) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/Mocking.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #110 (See https://builds.apache.org/job/HBase-0.94-security/110/)
        HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 (Sergey and Rajeshbabu) (Revision 1445350)

        Result = SUCCESS
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/Mocking.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #110 (See https://builds.apache.org/job/HBase-0.94-security/110/ ) HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 (Sergey and Rajeshbabu) (Revision 1445350) Result = SUCCESS tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/Mocking.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security-on-Hadoop-23 #12 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/12/)
        HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 (Sergey and Rajeshbabu) (Revision 1445350)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/Mocking.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security-on-Hadoop-23 #12 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/12/ ) HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 (Sergey and Rajeshbabu) (Revision 1445350) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/Mocking.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestZKBasedOpenCloseRegion.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #898 (See https://builds.apache.org/job/HBase-0.94/898/)
        HBASE-8040 - Race condition in AM after HBASE-7521 (only 0.94) (Ram) (Revision 1456310)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #898 (See https://builds.apache.org/job/HBase-0.94/898/ ) HBASE-8040 - Race condition in AM after HBASE-7521 (only 0.94) (Ram) (Revision 1456310) Result = FAILURE ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
        Hide
        Sergey Shelukhin added a comment -

        Hi. Quick check: what (if anything at all) prevents a race condition in this patch as is, if some server A dies, we start assigning the region from it to other server Band that server also dies (or, worse yet, it died before and now restarted, and we process the SSH just now).
        Wouldn't it cause double assign - first time assign of "resident" region from A, then "incoming" region of B?

        Show
        Sergey Shelukhin added a comment - Hi. Quick check: what (if anything at all) prevents a race condition in this patch as is, if some server A dies, we start assigning the region from it to other server Band that server also dies (or, worse yet, it died before and now restarted, and we process the SSH just now). Wouldn't it cause double assign - first time assign of "resident" region from A, then "incoming" region of B?
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #124 (See https://builds.apache.org/job/HBase-0.94-security/124/)
        HBASE-8040 - Race condition in AM after HBASE-7521 (only 0.94) (Ram) (Revision 1456310)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #124 (See https://builds.apache.org/job/HBase-0.94-security/124/ ) HBASE-8040 - Race condition in AM after HBASE-7521 (only 0.94) (Ram) (Revision 1456310) Result = FAILURE ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security-on-Hadoop-23 #13 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/13/)
        HBASE-8040 - Race condition in AM after HBASE-7521 (only 0.94) (Ram) (Revision 1456310)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security-on-Hadoop-23 #13 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/13/ ) HBASE-8040 - Race condition in AM after HBASE-7521 (only 0.94) (Ram) (Revision 1456310) Result = FAILURE ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java

          People

          • Assignee:
            Sergey Shelukhin
            Reporter:
            Sergey Shelukhin
          • Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development