HBase
  1. HBase
  2. HBASE-4400

.META. getting stuck if RS hosting it is dead and znode state is in RS_ZK_REGION_OPENED

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.5
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Start 2 RS.
      The .META. is being hosted by RS2 but while processing it goes down.

      Now restart the master and RS1. Master gets the RS name from the znode in RS_ZK_REGION_OPENED. But as RS2 is not online still the master is not able to process the META at all. Please find the logs

      2011-09-14 16:43:51,949 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=linux76,60020,1315998828523, region=70236052/-ROOT-
      2011-09-14 16:43:51,968 INFO org.apache.hadoop.hbase.master.HMaster: -ROOT- assigned=1, rit=false, location=linux76:60020
      2011-09-14 16:43:51,970 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region .META.,,1.1028785192 in state RS_ZK_REGION_OPENED
      2011-09-14 16:43:51,970 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed to find linux146,60020,1315998414623 in list of online servers; skipping registration of open of .META.,,1.1028785192
      2011-09-14 16:43:51,971 INFO org.apache.hadoop.hbase.master.AssignmentManager: Waiting on 1028785192/.META.
      2011-09-14 16:43:51,983 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=linux76,60020,1315998828523, region=70236052/-ROOT-
      2011-09-14 16:43:51,986 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 70236052; deleting unassigned node
      2011-09-14 16:43:51,986 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x13267854032001d Deleting existing unassigned node for 70236052 that is in expected state RS_ZK_REGION_OPENED
      2011-09-14 16:43:51,998 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x13267854032001d Successfully deleted unassigned node for region 70236052 in expected state RS_ZK_REGION_OPENED
      2011-09-14 16:43:51,999 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on linux76,60020,1315998828523
      2011-09-14 16:44:00,945 INFO org.apache.hadoop.hbase.master.ServerManager: Registering server=linux146,60020,1315998839724, regionCount=0, userLoad=false
      2011-09-14 16:46:20,003 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  .META.,,1.1028785192 state=OPEN, ts=0
      2011-09-14 16:46:20,004 ERROR org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for too long, we don't know where region was opened so can't do anything
      
              regionsInTransition.put(encodedRegionName, new RegionState(
                  regionInfo, RegionState.State.OPEN, data.getStamp()));
                ................
              } else {
                HServerInfo hsi = this.serverManager.getServerInfo(sn);
                if (hsi == null) {
                  LOG.info("Failed to find " + sn +
                    " in list of online servers; skipping registration of open of " +
                    regionInfo.getRegionNameAsString());
                } else {
                  new OpenedRegionHandler(master, this, regionInfo, hsi).process();
                }
              }
      

      So timeout monitor is not able to do anything here

                LOG.error("Region has been OPEN for too long, " +
                "we don't know where region was opened so can't do anything");
                synchronized(regionState) {
                  regionState.update(regionState.getState());
                }
      
      1. HBASE-4400_0.90_1.patch
        7 kB
        ramkrishna.s.vasudevan
      2. HBASE-4400_trunk_1.patch
        7 kB
        ramkrishna.s.vasudevan
      3. HBASE-4400_0.90.patch
        5 kB
        ramkrishna.s.vasudevan
      4. HBASE-4400_trunk.patch
        5 kB
        ramkrishna.s.vasudevan

        Activity

        Hide
        stack added a comment -

        Ted committed this yesterday

        Show
        stack added a comment - Ted committed this yesterday
        Hide
        ramkrishna.s.vasudevan added a comment -

        Thanks Ted and Stack.

        Show
        ramkrishna.s.vasudevan added a comment - Thanks Ted and Stack.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2228 (See https://builds.apache.org/job/HBase-TRUNK/2228/)
        HBASE-4400 fixed up the anonymous Abortable in createAndForceNodeToOpenedState()
        HBASE-4400 rename metaRegion to region in HBaseTestingUtility.createAndForceNodeToOpenedState()
        HBASE-4400 .META. getting stuck if RS hosting it is dead and znode state is in
        RS_ZK_REGION_OPENED (Ramkrishna)

        tedyu :
        Files :

        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java

        tedyu :
        Files :

        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java

        tedyu :
        Files :

        • /hbase/trunk/CHANGES.txt
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2228 (See https://builds.apache.org/job/HBase-TRUNK/2228/ ) HBASE-4400 fixed up the anonymous Abortable in createAndForceNodeToOpenedState() HBASE-4400 rename metaRegion to region in HBaseTestingUtility.createAndForceNodeToOpenedState() HBASE-4400 .META. getting stuck if RS hosting it is dead and znode state is in RS_ZK_REGION_OPENED (Ramkrishna) tedyu : Files : /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java tedyu : Files : /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java tedyu : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
        Hide
        Ted Yu added a comment -

        Minor change: I renamed metaRegion to region in createAndForceNodeToOpenedState()
        I also fixed up the anonymous Abortable in createAndForceNodeToOpenedState().

        Show
        Ted Yu added a comment - Minor change: I renamed metaRegion to region in createAndForceNodeToOpenedState() I also fixed up the anonymous Abortable in createAndForceNodeToOpenedState().
        Hide
        Ted Yu added a comment -

        Integrated to TRUNK and branch.

        Thanks for the patches Ramkrishna.

        Thanks for the review Michael.

        Show
        Ted Yu added a comment - Integrated to TRUNK and branch. Thanks for the patches Ramkrishna. Thanks for the review Michael.
        Hide
        stack added a comment -

        +1 Nice test.

        Show
        stack added a comment - +1 Nice test.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Addressing Ted's comment of removing the unused vaiable an dmoing the creation of OPENED node to HBAseTestingUtility.

        Show
        ramkrishna.s.vasudevan added a comment - Addressing Ted's comment of removing the unused vaiable an dmoing the creation of OPENED node to HBAseTestingUtility.
        Hide
        Ted Yu added a comment -

        +1 on patch.
        For TestMasterFailover#testShouldCheckMasterFailOverWhenMETAIsInOpenedState, the following is never read:

            int activeIndex = -1;
        

        We don't need the above variable.

        Around line 188, I think doing the following mimics the situation for this JIRA:

              if (null != metaRegion) {
                regionServer.abort("");
                break;
              }
        

        i.e. we only need to abort the region server carrying .META.
        I ran with the above change and the test passed.

        I like the way you transition .META. znode to RS_ZK_REGION_OPENED. I think it would be nice to extract lines 196 to 210 into a separate method so that other developers can utilize it later.

        I think the new method can be put into HBaseTestingUtility.

        Good job, Ramkrishna.

        Show
        Ted Yu added a comment - +1 on patch. For TestMasterFailover#testShouldCheckMasterFailOverWhenMETAIsInOpenedState, the following is never read: int activeIndex = -1; We don't need the above variable. Around line 188, I think doing the following mimics the situation for this JIRA: if ( null != metaRegion) { regionServer.abort(""); break ; } i.e. we only need to abort the region server carrying .META. I ran with the above change and the test passed. I like the way you transition .META. znode to RS_ZK_REGION_OPENED. I think it would be nice to extract lines 196 to 210 into a separate method so that other developers can utilize it later. I think the new method can be put into HBaseTestingUtility. Good job, Ramkrishna.
        Hide
        ramkrishna.s.vasudevan added a comment -

        All test cases are passing

        Show
        ramkrishna.s.vasudevan added a comment - All test cases are passing
        Hide
        ramkrishna.s.vasudevan added a comment -

        Testcases are running in trunk.

        Show
        ramkrishna.s.vasudevan added a comment - Testcases are running in trunk.
        Hide
        Ted Yu added a comment -

        I think the above approach is good.

        Show
        Ted Yu added a comment - I think the above approach is good.
        Hide
        ramkrishna.s.vasudevan added a comment -

        I am attaching the patch. Still testcases are running. I have written a testcase in the patch. If we run the testcase without the patch it exemplifies the problem in trunk and 0.90.x.
        The problem is very clear in 0.90.x but in trunk there is a little change in how things work when the master finds a region in transition in OPENED state
        In trunk

        -        } else if (isOnDeadServer(regionInfo, deadServers) &&
        -            !serverManager.isServerOnline(sn)) {
        -          // If was on a dead server, then its not open any more; needs
        -          // handling.
                  // If was on a dead server, then its not open any more; needs
                  // handling.
                  forceOffline(regionInfo, data);
                } else {
                  new OpenedRegionHandler(master, this, regionInfo, sn).process();
                }
        

        Here as per the condition above as the deadserver is not yet populated while processing meta region the else gets executed and the catalog tracker is notified of the META region on the dead server(but no META region is opened). But the IPC call is not able to be established to this dead server and hence assignmentManager.assignMeta() gets called and tries to assign the region but the call back doesnt happen when the new transition happens for the META node from OFFLINE to OPENING and OPENED. (why does the call back doesnot happen-> is it because metanodetracker executed the nodeDeleted() api. Not sure.)

        Now in case of 0.90.x as already commented in my previous comments

        forceOffline(regionInfo, data);
                } else {
                  HServerInfo hsi = this.serverManager.getServerInfo(sn);
                  if (hsi == null) {
                    LOG.info("Failed to find " + sn +
                      " in list of online servers; skipping registration of open of " +
                      regionInfo.getRegionNameAsString());
                  } else {
                    new OpenedRegionHandler(master, this, regionInfo, hsi).process();
                  }
        

        An additional check is present which makes things worse as the RIT is not able to process the META in opened state and the system hangs for ever.
        So in both the versions my idea was to modify the condition check

        -        } else if (isOnDeadServer(regionInfo, deadServers) &&
        -            !serverManager.isServerOnline(sn)) {
        

        to

        +        } else if (!serverManager.isServerOnline(sn)
        +            && (isOnDeadServer(regionInfo, deadServers)
        +                || regionInfo.isMetaRegion() || regionInfo.isRootRegion())) {
        

        SO That in both the cases the META node can be forced to OFFLINE and a fresh assignment can be done.
        Testcases are running. Like to know your ideas if it is fine to do like this.

        Show
        ramkrishna.s.vasudevan added a comment - I am attaching the patch. Still testcases are running. I have written a testcase in the patch. If we run the testcase without the patch it exemplifies the problem in trunk and 0.90.x. The problem is very clear in 0.90.x but in trunk there is a little change in how things work when the master finds a region in transition in OPENED state In trunk - } else if (isOnDeadServer(regionInfo, deadServers) && - !serverManager.isServerOnline(sn)) { - // If was on a dead server, then its not open any more; needs - // handling. // If was on a dead server, then its not open any more; needs // handling. forceOffline(regionInfo, data); } else { new OpenedRegionHandler(master, this , regionInfo, sn).process(); } Here as per the condition above as the deadserver is not yet populated while processing meta region the else gets executed and the catalog tracker is notified of the META region on the dead server(but no META region is opened). But the IPC call is not able to be established to this dead server and hence assignmentManager.assignMeta() gets called and tries to assign the region but the call back doesnt happen when the new transition happens for the META node from OFFLINE to OPENING and OPENED. (why does the call back doesnot happen-> is it because metanodetracker executed the nodeDeleted() api. Not sure.) Now in case of 0.90.x as already commented in my previous comments forceOffline(regionInfo, data); } else { HServerInfo hsi = this .serverManager.getServerInfo(sn); if (hsi == null ) { LOG.info( "Failed to find " + sn + " in list of online servers; skipping registration of open of " + regionInfo.getRegionNameAsString()); } else { new OpenedRegionHandler(master, this , regionInfo, hsi).process(); } An additional check is present which makes things worse as the RIT is not able to process the META in opened state and the system hangs for ever. So in both the versions my idea was to modify the condition check - } else if (isOnDeadServer(regionInfo, deadServers) && - !serverManager.isServerOnline(sn)) { to + } else if (!serverManager.isServerOnline(sn) + && (isOnDeadServer(regionInfo, deadServers) + || regionInfo.isMetaRegion() || regionInfo.isRootRegion())) { SO That in both the cases the META node can be forced to OFFLINE and a fresh assignment can be done. Testcases are running. Like to know your ideas if it is fine to do like this.
        Hide
        ramkrishna.s.vasudevan added a comment -

        The above analysis for 0.90.x.
        Trunk will verify and then see if there is any problem.

        Show
        ramkrishna.s.vasudevan added a comment - The above analysis for 0.90.x. Trunk will verify and then see if there is any problem.
        Hide
        Ted Yu added a comment -

        The above analysis makes sense.

        Show
        Ted Yu added a comment - The above analysis makes sense.
        Hide
        ramkrishna.s.vasudevan added a comment -

        The root cause of the problem here is in HMaster

            // Make sure root and meta assigned before proceeding.
            assignRootAndMeta();
        
            // Is this fresh start with no regions assigned or are we a master joining
            // an already-running cluster?  If regionsCount == 0, then for sure a
            // fresh start.  TOOD: Be fancier.  If regionsCount == 2, perhaps the
            // 2 are .META. and -ROOT- and we should fall into the fresh startup
            // branch below.  For now, do processFailover.
            if (regionCount == 0) {
              LOG.info("Master startup proceeding: cluster startup");
              this.assignmentManager.cleanoutUnassigned();
              this.assignmentManager.assignAllUserRegions();
            } else {
              LOG.info("Master startup proceeding: master failover");
              this.assignmentManager.processFailover();
            }
        

        assigning root and meta is done first and only then processfailover is called where we care about the dead servers and online servers. So now when the master sees the META in RIT the znode state is OPENED state and we are not able to bring the META out of transition even by the timeout monitor.
        Correct me if my analysis is wrong.

        Show
        ramkrishna.s.vasudevan added a comment - The root cause of the problem here is in HMaster // Make sure root and meta assigned before proceeding. assignRootAndMeta(); // Is this fresh start with no regions assigned or are we a master joining // an already-running cluster? If regionsCount == 0, then for sure a // fresh start. TOOD: Be fancier. If regionsCount == 2, perhaps the // 2 are .META. and -ROOT- and we should fall into the fresh startup // branch below. For now, do processFailover. if (regionCount == 0) { LOG.info( "Master startup proceeding: cluster startup" ); this .assignmentManager.cleanoutUnassigned(); this .assignmentManager.assignAllUserRegions(); } else { LOG.info( "Master startup proceeding: master failover" ); this .assignmentManager.processFailover(); } assigning root and meta is done first and only then processfailover is called where we care about the dead servers and online servers. So now when the master sees the META in RIT the znode state is OPENED state and we are not able to bring the META out of transition even by the timeout monitor. Correct me if my analysis is wrong.

          People

          • Assignee:
            ramkrishna.s.vasudevan
            Reporter:
            ramkrishna.s.vasudevan
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development