HBase
  1. HBase
  2. HBASE-4340

Hbase can't balance if ServerShutdownHandler encountered exception

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.90.4
    • Fix Version/s: 0.90.5
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Version: 0.90.4
      Cluster : 40 boxes
      As I saw below logs. It said that balance couldn't work because of a dead RS.
      I dug deeply and found two issues:

      1. shutdownhandler didn't clear numProcessing deal with some exceptions. It seems whatever exceptions we should clear the flag or close master.

      2. "dead regionserver(s): [158-1-130-12,20020,1314971097929]" is inaccurate. The dead sever should be " 158-1-130-10,20020,1315068597979"

      //master logs:
      2011-09-05 00:28:00,487 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 00:33:00,489 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 00:38:00,493 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 00:43:00,495 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 00:48:00,499 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 00:53:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 00:58:00,501 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 01:03:00,502 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 01:08:00,506 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 01:13:00,508 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 01:18:00,512 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 01:23:00,514 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 01:28:00,518 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 01:33:00,520 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 01:38:00,524 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 01:43:00,526 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 01:48:00,530 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 01:53:00,532 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 01:58:00,536 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 02:03:00,537 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 02:08:00,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 02:13:00,539 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]
      2011-09-05 02:18:00,543 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-130-12,20020,1314971097929]

      // the exception logs :.
      2011-09-03 18:13:27,550 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=158-1-133-11,20020,1315069437236, region=0db4088d75c58dd22f93f389d90ba6cc
      2011-09-03 18:13:27,550 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN java.lang.NullPointerException
      at org.apache.hadoop.hbase.util.Bytes.toLong(Bytes.java:480)
      at org.apache.hadoop.hbase.util.Bytes.toLong(Bytes.java:454)
      at org.apache.hadoop.hbase.catalog.MetaReader.metaRowToRegionPairWithInfo(MetaReader.java:400)
      at org.apache.hadoop.hbase.catalog.MetaReader.getServerUserRegions(MetaReader.java:591)
      at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:176)
      at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      at java.lang.Thread.run(Thread.java:662)
      2011-09-03 18:13:27,550 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for 158-1-134-15,20020,1315065238916
      2011-09-03 18:13:27,566 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for ufdr,001146,1314955304624.22f6d43e78c903196f206881fc149488. so generated a random one; hri=ufdr,001146,1314955304624.22f6d43e78c903196f206881fc149488., src=, dest=158-1-132-17,20020,1315069441916; 31 (online=31, exclude=null) available servers
      201

        Activity

        Hide
        gaojinchao added a comment -

        Thanks for your work. Ted.
        I want to patch through to review, and then make a trunk patch. All test case passed need two hours.

        Show
        gaojinchao added a comment - Thanks for your work. Ted. I want to patch through to review, and then make a trunk patch. All test case passed need two hours.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2196 (See https://builds.apache.org/job/HBase-TRUNK/2196/)
        HBASE-4340 Hbase can't balance if ServerShutdownHandler encountered
        exception (Jinchao Gao)

        tedyu :
        Files :

        • /hbase/trunk/CHANGES.txt
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2196 (See https://builds.apache.org/job/HBase-TRUNK/2196/ ) HBASE-4340 Hbase can't balance if ServerShutdownHandler encountered exception (Jinchao Gao) tedyu : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
        Hide
        Ted Yu added a comment -

        Patch for TRUNK has been integrated.
        Original patch integrated to 0.90 branch.

        Thanks for the patch Jinchao.

        Show
        Ted Yu added a comment - Patch for TRUNK has been integrated. Original patch integrated to 0.90 branch. Thanks for the patch Jinchao.
        Hide
        gaojinchao added a comment -

        Yes, All test cases have passed.

        Show
        gaojinchao added a comment - Yes, All test cases have passed.
        Hide
        Ted Yu added a comment -

        The NPE happened on this line in MetaReader.java:

              final long startCode = Bytes.toLong(data.getValue(HConstants.CATALOG_FAMILY,
                  HConstants.STARTCODE_QUALIFIER));
        

        The patch looks reasonable since there is no action taken if hris is null.

        Have you tested the patch on a cluster, Jinchao ?

        Show
        Ted Yu added a comment - The NPE happened on this line in MetaReader.java: final long startCode = Bytes.toLong(data.getValue(HConstants.CATALOG_FAMILY, HConstants.STARTCODE_QUALIFIER)); The patch looks reasonable since there is no action taken if hris is null. Have you tested the patch on a cluster, Jinchao ?
        Hide
        gaojinchao added a comment -

        I have made a patch, Please review.

        Show
        gaojinchao added a comment - I have made a patch, Please review.

          People

          • Assignee:
            gaojinchao
            Reporter:
            gaojinchao
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development