HBase
  1. HBase
  2. HBASE-5396

Handle the regions in regionPlans while processing ServerShutdownHandler

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.90.6
    • Fix Version/s: 0.92.3
    • Component/s: master
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The regions plan to open on this server while ServerShutdownHandler is handling, just be removed from AM.regionPlans, and only left to TimeoutMonitor handle these regions. This need to optimize.

      1. HBASE-5396-90.patch
        9 kB
        Jieshan Bean
      2. HBASE-5396-90-final.patch
        10 kB
        Jieshan Bean
      3. HBASE-5396-90-forReview.patch
        9 kB
        Jieshan Bean
      4. HBASE-5396-90-V2.patch
        10 kB
        Jieshan Bean
      5. HBASE-5396-92.patch
        8 kB
        Jieshan Bean
      6. HBASE-5396-trunk.patch
        8 kB
        Jieshan Bean
      7. Logs-TestFor92.rar
        257 kB
        Jieshan Bean

        Issue Links

          Activity

          Hide
          Ted Yu added a comment -

          @Jieshan:
          Have you tested your patch in a 0.90 cluster ?

          +  public boolean isRegionOnline(HRegionInfo hri) {
          +    HServerInfo hsi = this.regions.get(hri);
          +    if (hsi != null && this.isServerOnline(hsi.getServerName())) {
          

          If hsi isn't online, should we remove region to server assignment mapping for hsi ?

          For RegionsWithDeadServer, would DeadServerWithRegions be a better name ?

          +    private Set<HRegionInfo> regionPlanOnThisServer = null;
          

          Since the member of the Set isn't RegionPlan, I suggest renaming the above field to regionsOnServer.

          Also, I think regionsOnServer and regionsInTransition should be initialized with empty HashSet. This would potentially avoid NPE in a few places.

          Show
          Ted Yu added a comment - @Jieshan: Have you tested your patch in a 0.90 cluster ? + public boolean isRegionOnline(HRegionInfo hri) { + HServerInfo hsi = this .regions.get(hri); + if (hsi != null && this .isServerOnline(hsi.getServerName())) { If hsi isn't online, should we remove region to server assignment mapping for hsi ? For RegionsWithDeadServer, would DeadServerWithRegions be a better name ? + private Set<HRegionInfo> regionPlanOnThisServer = null ; Since the member of the Set isn't RegionPlan, I suggest renaming the above field to regionsOnServer. Also, I think regionsOnServer and regionsInTransition should be initialized with empty HashSet. This would potentially avoid NPE in a few places.
          Hide
          Ted Yu added a comment -
          +                && regionPlanOnThisServer != null
          +                && !regionPlanOnThisServer.contains(rit.getRegion())) {
          

          Did you mean this ?

          +                && (regionsOnServer == null
          +                || !regionsOnServer.contains(rit.getRegion()))) {
          
          

          As I said above, if regionsOnServer was initialized to empty Set, the null check above can be omitted.

          Show
          Ted Yu added a comment - + && regionPlanOnThisServer != null + && !regionPlanOnThisServer.contains(rit.getRegion())) { Did you mean this ? + && (regionsOnServer == null + || !regionsOnServer.contains(rit.getRegion()))) { As I said above, if regionsOnServer was initialized to empty Set, the null check above can be omitted.
          Hide
          Jieshan Bean added a comment -

          Thanks, Ted. Only one doubt basing on your comments:

          Since the member of the Set isn't RegionPlan, I suggest renaming the above field to regionsOnServer.

          Since the Set came from the regionPlans. So can I keep this name as regionPlanOnThisServer? I think regionsOnServer will bring some misunderstanding. right?

          I have test this patch in 90 cluster for many times. And it works fine. I will upload the new patch today.

          Show
          Jieshan Bean added a comment - Thanks, Ted. Only one doubt basing on your comments: Since the member of the Set isn't RegionPlan, I suggest renaming the above field to regionsOnServer. Since the Set came from the regionPlans. So can I keep this name as regionPlanOnThisServer? I think regionsOnServer will bring some misunderstanding. right? I have test this patch in 90 cluster for many times. And it works fine. I will upload the new patch today.
          Hide
          Ted Yu added a comment -

          If you agree with renaming RegionsWithDeadServer to DeadServerWithRegions, I think regionsOnServer should be more suitable with DeadServerWithRegions. Basically it means region info's on the dead server.

          Show
          Ted Yu added a comment - If you agree with renaming RegionsWithDeadServer to DeadServerWithRegions, I think regionsOnServer should be more suitable with DeadServerWithRegions. Basically it means region info's on the dead server.
          Hide
          Jieshan Bean added a comment -

          ok..I rename it as "regionsOnServer". I will upload the new patch after tests. Thanks, Ted.

          Show
          Jieshan Bean added a comment - ok..I rename it as "regionsOnServer". I will upload the new patch after tests. Thanks, Ted.
          Hide
          stack added a comment -

          On the below:

          +  public boolean isRegionOnline(HRegionInfo hri) {
          +    HServerInfo hsi = this.regions.get(hri);
          +    if (hsi != null && this.isServerOnline(hsi.getServerName())) {
          +      return true;
          +    }
          +    return false;
          +  }
          

          Don't you have to take out a lock on this.regions before you access it? See the comment in this.servers.

          Also, could write the end of the method so:

          return hsi != null && this.isServerOnline(hsi.getServerName();
          

          Whats a RegionsWithDeadServer? Is it RegionsOnDeadServers?

          The below is called regionplan but its storing HRegionInfos?

          +    Set<HRegionInfo> regionPlanOnThisServer = new HashSet<HRegionInfo>();
          

          And then here, we are storing a Set of HRIs but method name talks of RegionPlans. Its a little hard to follow?

          Ditto here:

          +    private Set<HRegionInfo> regionPlanOnThisServer = null;
          

          and this...

          +    public Set<HRegionInfo> getRegionPlanOnThisServer() {
          

          This comment doesn't seem right?

          +   * Process result used by processServerShutdown.
          

          There is no processing done in this data structure.

          ... and save a few lines?

          Show
          stack added a comment - On the below: + public boolean isRegionOnline(HRegionInfo hri) { + HServerInfo hsi = this .regions.get(hri); + if (hsi != null && this .isServerOnline(hsi.getServerName())) { + return true ; + } + return false ; + } Don't you have to take out a lock on this.regions before you access it? See the comment in this.servers. Also, could write the end of the method so: return hsi != null && this .isServerOnline(hsi.getServerName(); Whats a RegionsWithDeadServer? Is it RegionsOnDeadServers? The below is called regionplan but its storing HRegionInfos? + Set<HRegionInfo> regionPlanOnThisServer = new HashSet<HRegionInfo>(); And then here, we are storing a Set of HRIs but method name talks of RegionPlans. Its a little hard to follow? Ditto here: + private Set<HRegionInfo> regionPlanOnThisServer = null ; and this... + public Set<HRegionInfo> getRegionPlanOnThisServer() { This comment doesn't seem right? + * Process result used by processServerShutdown. There is no processing done in this data structure. ... and save a few lines?
          Hide
          Jieshan Bean added a comment -

          This patch was modified basing on the above comments. I upload it just for review. Plz share your comment again if possible. Then I will start the tests.
          Thanks Stack & Ted.

          Show
          Jieshan Bean added a comment - This patch was modified basing on the above comments. I upload it just for review. Plz share your comment again if possible. Then I will start the tests. Thanks Stack & Ted.
          Hide
          Jieshan Bean added a comment -

          I renamed "regionPlanOnThisServer" to "regionOpeningOnServer", for emphasizing those regions were being processed when this HRegionServer been killed.
          Since the class name is "RegionsOnDeadServer", and "regionPlanOnThisServer" can be distinguished from it.

            public RegionsOnDeadServer processServerShutdown(final HServerInfo hsi) {
              RegionsOnDeadServer regionsOnDeadServer = new RegionsOnDeadServer();
              Set<HRegionInfo> regionsOpeningOnServer = new HashSet<HRegionInfo>();
          
          Show
          Jieshan Bean added a comment - I renamed "regionPlanOnThisServer" to "regionOpeningOnServer", for emphasizing those regions were being processed when this HRegionServer been killed. Since the class name is "RegionsOnDeadServer", and "regionPlanOnThisServer" can be distinguished from it. public RegionsOnDeadServer processServerShutdown(final HServerInfo hsi) { RegionsOnDeadServer regionsOnDeadServer = new RegionsOnDeadServer(); Set<HRegionInfo> regionsOpeningOnServer = new HashSet<HRegionInfo>();
          Hide
          Jieshan Bean added a comment -

          All tests passed. And I also tested it in real cluster.

          Show
          Jieshan Bean added a comment - All tests passed. And I also tested it in real cluster.
          Hide
          Ted Yu added a comment -
          +          // Store the related regions in regionPlans.
          +          regionsOpeningOnServer.add(e.getValue().getRegionInfo());
          

          In order to make field name consistent with the above javadoc, how about naming it regionsFromRegionPlansForServer so that its source is clear ?

          Also, please take a look at TRUNK and be prepared for TRUNK patch after 0.90 fix is finalized.

          Thanks for the hard work.

          Show
          Ted Yu added a comment - + // Store the related regions in regionPlans. + regionsOpeningOnServer.add(e.getValue().getRegionInfo()); In order to make field name consistent with the above javadoc, how about naming it regionsFromRegionPlansForServer so that its source is clear ? Also, please take a look at TRUNK and be prepared for TRUNK patch after 0.90 fix is finalized. Thanks for the hard work.
          Hide
          Jieshan Bean added a comment -

          Sometime, it's really hard to get a good name.
          I've changed it in "HBASE-5396-90-final.patch". Thanks Ted.
          I will make the patch for TRUNK today.

          Show
          Jieshan Bean added a comment - Sometime, it's really hard to get a good name. I've changed it in " HBASE-5396 -90-final.patch". Thanks Ted. I will make the patch for TRUNK today.
          Hide
          Ted Yu added a comment -

          +1 on patch for 90. Running test suite.

          Show
          Ted Yu added a comment - +1 on patch for 90. Running test suite.
          Hide
          Ted Yu added a comment -

          Test result for latest patch is positive.

          HBASE-5396-90-final.patch is good to go for 0.90 branch

          Show
          Ted Yu added a comment - Test result for latest patch is positive. HBASE-5396 -90-final.patch is good to go for 0.90 branch
          Hide
          ramkrishna.s.vasudevan added a comment -

          Committed to 0.90. Once Jieshan gives patch for trunk will commit to 0.92 and trunk.

          Show
          ramkrishna.s.vasudevan added a comment - Committed to 0.90. Once Jieshan gives patch for trunk will commit to 0.92 and trunk.
          Hide
          Jieshan Bean added a comment -

          I tested the patch for 92 by unit test and also in real cluster. This problem seems not represent in 92 and trunk version. I will give more tests and went through the code to check whether it's necessary for 92 and trunk.

          Show
          Jieshan Bean added a comment - I tested the patch for 92 by unit test and also in real cluster. This problem seems not represent in 92 and trunk version. I will give more tests and went through the code to check whether it's necessary for 92 and trunk.
          Hide
          stack added a comment -

          @Jieshan Thats interesting. Thanks for checking it out. Why do we not have the prob. in 0.92/trunk? Is code different?

          Show
          stack added a comment - @Jieshan Thats interesting. Thanks for checking it out. Why do we not have the prob. in 0.92/trunk? Is code different?
          Hide
          Jieshan Bean added a comment -

          That problem is still there in 92 and trunk..but difficult to reproduce..After more tests, I reproduced it and verified the patch for 92. I haven't tested the patch for trunk.

          Show
          Jieshan Bean added a comment - That problem is still there in 92 and trunk..but difficult to reproduce..After more tests, I reproduced it and verified the patch for 92. I haven't tested the patch for trunk.
          Hide
          Jieshan Bean added a comment -

          RegionServer Log:

          2012-02-20 23:24:49,432 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3359b077bc90018 Attempting to transition node 897ad476f426e58a36cae77c7302be1d from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
          2012-02-20 23:24:49,447 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3359b077bc90018 Successfully transitioned node 897ad476f426e58a36cae77c7302be1d from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
          2012-02-20 23:24:49,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Opening region: {NAME => 'jeason,001477777777,1329747764680.897ad476f426e58a36cae77c7302be1d.', STARTKEY => '001477777777', ENDKEY => '001488888888', ENCODED => 897ad476f426e58a36cae77c7302be1d,}
          2012-02-20 23:24:49,448 INFO org.apache.hadoop.hbase.regionserver.HRegion: Setting up tabledescriptor config now ...
          2012-02-20 23:24:49,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated jeason,001477777777,1329747764680.897ad476f426e58a36cae77c7302be1d.
          2012-02-20 23:24:49,455 INFO org.apache.hadoop.hbase.regionserver.HRegion: Onlined jeason,001477777777,1329747764680.897ad476f426e58a36cae77c7302be1d.; next sequenceid=1
          //This regionserver was killed just after the below log.
          2012-02-20 23:24:49,455 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3359b077bc90018 Attempting to transition node 897ad476f426e58a36cae77c7302be1d from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
          

          HMaster Log:

          2012-02-20 23:25:36,442 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region jeason,001477777777,1329747764680.897ad476f426e58a36cae77c7302be1d. to C3S32,20020,1329798176762
          2012-02-20 23:25:36,442 DEBUG org.apache.hadoop.hbase.master.ServerManager: New connection to C3S32,20020,1329798176762
          2012-02-20 23:25:36,462 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=C3S32,20020,1329798176762, region=561e68708199320894bfcfc533cae772
          2012-02-20 23:25:36,462 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for jeason,001377777777,1329747764680.561e68708199320894bfcfc533cae772. from C3S32,20020,1329798176762; deleting unassigned node
          2012-02-20 23:25:36,462 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x2359b0778ec000d Deleting existing unassigned node for 561e68708199320894bfcfc533cae772 that is in expected state RS_ZK_REGION_OPENED
          2012-02-20 23:25:36,463 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=C3S32,20020,1329798176762, region=3455f468b76dcf7cfe3881f9e156d0c8
          2012-02-20 23:25:36,468 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x2359b0778ec000d Successfully deleted unassigned node for region 561e68708199320894bfcfc533cae772 in expected state RS_ZK_REGION_OPENED
          2012-02-20 23:25:36,468 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region jeason,001377777777,1329747764680.561e68708199320894bfcfc533cae772. has been deleted.
          2012-02-20 23:25:36,468 INFO org.apache.hadoop.hbase.master.AssignmentManager: The master has opened the region jeason,001377777777,1329747764680.561e68708199320894bfcfc533cae772. that was online on C3S32,20020,1329798176762
          2012-02-20 23:25:36,472 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: 1 regions which planned to open on C3S31,20020,1329798177361 be re-assigned.
          
          Show
          Jieshan Bean added a comment - RegionServer Log: 2012-02-20 23:24:49,432 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3359b077bc90018 Attempting to transition node 897ad476f426e58a36cae77c7302be1d from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING 2012-02-20 23:24:49,447 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3359b077bc90018 Successfully transitioned node 897ad476f426e58a36cae77c7302be1d from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING 2012-02-20 23:24:49,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Opening region: {NAME => 'jeason,001477777777,1329747764680.897ad476f426e58a36cae77c7302be1d.', STARTKEY => '001477777777', ENDKEY => '001488888888', ENCODED => 897ad476f426e58a36cae77c7302be1d,} 2012-02-20 23:24:49,448 INFO org.apache.hadoop.hbase.regionserver.HRegion: Setting up tabledescriptor config now ... 2012-02-20 23:24:49,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated jeason,001477777777,1329747764680.897ad476f426e58a36cae77c7302be1d. 2012-02-20 23:24:49,455 INFO org.apache.hadoop.hbase.regionserver.HRegion: Onlined jeason,001477777777,1329747764680.897ad476f426e58a36cae77c7302be1d.; next sequenceid=1 //This regionserver was killed just after the below log. 2012-02-20 23:24:49,455 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3359b077bc90018 Attempting to transition node 897ad476f426e58a36cae77c7302be1d from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING HMaster Log: 2012-02-20 23:25:36,442 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region jeason,001477777777,1329747764680.897ad476f426e58a36cae77c7302be1d. to C3S32,20020,1329798176762 2012-02-20 23:25:36,442 DEBUG org.apache.hadoop.hbase.master.ServerManager: New connection to C3S32,20020,1329798176762 2012-02-20 23:25:36,462 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=C3S32,20020,1329798176762, region=561e68708199320894bfcfc533cae772 2012-02-20 23:25:36,462 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for jeason,001377777777,1329747764680.561e68708199320894bfcfc533cae772. from C3S32,20020,1329798176762; deleting unassigned node 2012-02-20 23:25:36,462 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x2359b0778ec000d Deleting existing unassigned node for 561e68708199320894bfcfc533cae772 that is in expected state RS_ZK_REGION_OPENED 2012-02-20 23:25:36,463 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=C3S32,20020,1329798176762, region=3455f468b76dcf7cfe3881f9e156d0c8 2012-02-20 23:25:36,468 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x2359b0778ec000d Successfully deleted unassigned node for region 561e68708199320894bfcfc533cae772 in expected state RS_ZK_REGION_OPENED 2012-02-20 23:25:36,468 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region jeason,001377777777,1329747764680.561e68708199320894bfcfc533cae772. has been deleted. 2012-02-20 23:25:36,468 INFO org.apache.hadoop.hbase.master.AssignmentManager: The master has opened the region jeason,001377777777,1329747764680.561e68708199320894bfcfc533cae772. that was online on C3S32,20020,1329798176762 2012-02-20 23:25:36,472 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: 1 regions which planned to open on C3S31,20020,1329798177361 be re-assigned.
          Hide
          Jieshan Bean added a comment -

          I'm running the unit tests for TRUNK.

          Show
          Jieshan Bean added a comment - I'm running the unit tests for TRUNK.
          Hide
          stack added a comment -

          If you found it in 0.92, thats good enough – its in TRUNK I'd say.

          Do you have more of the regionserver log? Why does it say its aborting? You don't have it in your log above (The logs above look 'normal'.. we need to bits that show it gone awry... thanks Jieshan).

          Show
          stack added a comment - If you found it in 0.92, thats good enough – its in TRUNK I'd say. Do you have more of the regionserver log? Why does it say its aborting? You don't have it in your log above (The logs above look 'normal'.. we need to bits that show it gone awry... thanks Jieshan).
          Hide
          Jieshan Bean added a comment -

          This is the complete logs. I added some logs to track. And add sleeping code in opening to enlarge the proability of this problem.

          Show
          Jieshan Bean added a comment - This is the complete logs. I added some logs to track. And add sleeping code in opening to enlarge the proability of this problem.
          Hide
          Jieshan Bean added a comment -

          All tests passed for TRUNK.

          Show
          Jieshan Bean added a comment - All tests passed for TRUNK.
          Hide
          stack added a comment -

          The log does not show the regionserver aborting. Should it? Or am I misunderstanding (I guess I'm not clear on what I should be looking for in this log. Please help me Jieshan. Sorry for being a bit slow).

          Show
          stack added a comment - The log does not show the regionserver aborting. Should it? Or am I misunderstanding (I guess I'm not clear on what I should be looking for in this log. Please help me Jieshan. Sorry for being a bit slow).
          Hide
          Jieshan Bean added a comment -

          @stack,
          RegionServer process was killed manually. Not aborted itself. This is the steps of my test:
          Suppose there's 3 nodes in cluster: A, B, C.
          1. Create a table with 1000 regions.
          2. Assign all regions to C.
          3. Kill C. During the ServerShutdownHandler processing, then kill A(At this time, some regions may been openning on this server.)

          See whether all the regions can be opened in time(And also check the regions which in regionPlans).

          Show
          Jieshan Bean added a comment - @stack, RegionServer process was killed manually. Not aborted itself. This is the steps of my test: Suppose there's 3 nodes in cluster: A, B, C. 1. Create a table with 1000 regions. 2. Assign all regions to C. 3. Kill C. During the ServerShutdownHandler processing, then kill A(At this time, some regions may been openning on this server.) See whether all the regions can be opened in time(And also check the regions which in regionPlans).
          Hide
          stack added a comment -

          So what am I looking to see in the log snippet above (and in the attached log?) Thanks Jieshan.

          Show
          stack added a comment - So what am I looking to see in the log snippet above (and in the attached log?) Thanks Jieshan.
          Hide
          Jieshan Bean added a comment -

          Only see the below and related logs. It means the region in regionPlans can be reassigned with this patch.

          2012-02-20 23:25:36,472 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: 1 regions which planned to open on C3S31,20020,1329798177361 be re-assigned.
          

          Thanks, Stack.

          Show
          Jieshan Bean added a comment - Only see the below and related logs. It means the region in regionPlans can be reassigned with this patch. 2012-02-20 23:25:36,472 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: 1 regions which planned to open on C3S31,20020,1329798177361 be re-assigned. Thanks, Stack.
          Hide
          ramkrishna.s.vasudevan added a comment -

          BTW,
          do we need to put this into 0.92 and above?

          Show
          ramkrishna.s.vasudevan added a comment - BTW, do we need to put this into 0.92 and above?

            People

            • Assignee:
              Jieshan Bean
              Reporter:
              Jieshan Bean
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development