HBase
  1. HBase
  2. HBASE-5615

the master never does balance because of balancing the parent region

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.90.7
    • Fix Version/s: 0.90.7
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      the master never do balance becauseof when master do rebuildUserRegions(),it will add the parent region into AssignmentManager#servers,
      if balancer let the parent region to move,the parent will in RIT forever.thus balance will never be executed.

      1. 5615-trunk.txt
        0.7 kB
        Ted Yu
      2. HBASE-5615.patch
        0.7 kB
        xufeng
      3. HBASE-5615-90.patch
        0.7 kB
        xufeng
      4. NoPatched-surefire-report-5615-90.html
        201 kB
        xufeng
      5. Patched_surefire-report-5615-90.html
        201 kB
        xufeng

        Activity

        Hide
        xufeng added a comment -

        In my cluster I found this issue.

        1.the balance never be executed because:

        [2012-03-21 14:11:47,226] [DEBUG] [158-1-131-48:20000-BalancerChore] [org.apache.hadoop.hbase.master.HMaster 824] Not running balancer because 4 region(s) in transition: {3139250177b9c55fbce6856e2595b272=hbaseTable3,06640#000149,1332230348477.3139250177b9c55fbce6856e2595b272. state=PENDING_CLOSE, ts=1332339058374, 3d7698062c1ffaa288ffa4b0630205dd=hbaseTable,12284#000051,1332214163915.3d7698062c1ffaa288ffa4b0630205dd. st...
        

        2.choose the 3139250177b9c55fbce6856e2595b272 as a sample to track.
        I found it has be splited:

        [2012-03-20 23:40:36,496] [INFO ] [regionserver20020.compactor] [org.apache.hadoop.hbase.regionserver.HRegion 563] Closed hbaseTable3,06640#000149,1332230348477.3139250177b9c55fbce6856e2595b272.
        [2012-03-20 23:40:38,469] [INFO ] [regionserver20020.compactor] [org.apache.hadoop.hbase.catalog.MetaEditor 85] Offlined parent region hbaseTable3,06640#000149,1332230348477.3139250177b9c55fbce6856e2595b272. in META
        [2012-03-20 23:40:39,755] [INFO ] [regionserver20020.compactor] [org.apache.hadoop.hbase.regionserver.CompactSplitThread 181] Region split, META updated, and report to master. Parent=hbaseTable3,06640#000149,1332230348477.3139250177b9c55fbce6856e2595b272., new regions: hbaseTable3,06640#000149,1332286834610.bf8baeae598db2a1e87dbd0a234d1539., hbaseTable3,06723#000707,1332286834610.64ccaffa46be50a5dbc41540006afcb6.. Split took 5sec
        

        3.then the backup master active one, in finishInitialization() logs,I found those logs:
        [2012-03-21 11:41:46,692] [DEBUG] [master-158-1-131-48:20000] [org.apache.hadoop.hbase.master.handler.ServerShutdownHandler 348] Daughter hbaseTable3,06640#000149,1332286834610.bf8baeae598db2a1e87dbd0a234d1539. present

        4.so I ensure that the parent region(3139250177b9c55fbce6856e2595b272) also in META table.

        5.if 3139250177b9c55fbce6856e2595b272 in META, it will be added to AssignmentManager#regions and AssignmentManager#servers when master rebuild the user regions.

        6.balance will reference to AssignmentManager#servers to let the 3139250177b9c55fbce6856e2595b272 to move:

        [2012-03-21 11:46:47,699] [INFO ] [158-1-131-48:20000-BalancerChore] [org.apache.hadoop.hbase.master.HMaster 849] balance hri=hbaseTable3,06640#000149,1332230348477.3139250177b9c55fbce6856e2595b272., src=158-1-131-48,20020,1331918756600, dest=158-1-130-11,20020,1331918756573
        

        7.the parent will in RIT forever as PENDING_CLOSE state,thus balance will never be executed

        [2012-03-21 13:13:57,201] [WARN ] [PRI IPC Server handler 3 on 20020] [org.apache.hadoop.hbase.regionserver.HRegionServer 2211] Received close for region we are not serving; 3139250177b9c55fbce6856e2595b272
        
        [2012-03-21 11:55:55,638] [INFO ] [158-1-131-48:20000.timeoutMonitor] [org.apache.hadoop.hbase.master.AssignmentManager 2327] Regions in transition timed out:  hbaseTable3,06640#000149,1332230348477.3139250177b9c55fbce6856e2595b272. state=PENDING_CLOSE, ts=1332330775586
        [2012-03-21 11:55:55,639] [INFO ] [158-1-131-48:20000.timeoutMonitor] [org.apache.hadoop.hbase.master.AssignmentManager 2363] Region has been PENDING_CLOSE for too long, running forced unassign again on region=hbaseTable3,06640#000149,1332230348477.3139250177b9c55fbce6856e2595b272.
        
        Show
        xufeng added a comment - In my cluster I found this issue. 1.the balance never be executed because: [2012-03-21 14:11:47,226] [DEBUG] [158-1-131-48:20000-BalancerChore] [org.apache.hadoop.hbase.master.HMaster 824] Not running balancer because 4 region(s) in transition: {3139250177b9c55fbce6856e2595b272=hbaseTable3,06640#000149,1332230348477.3139250177b9c55fbce6856e2595b272. state=PENDING_CLOSE, ts=1332339058374, 3d7698062c1ffaa288ffa4b0630205dd=hbaseTable,12284#000051,1332214163915.3d7698062c1ffaa288ffa4b0630205dd. st... 2.choose the 3139250177b9c55fbce6856e2595b272 as a sample to track. I found it has be splited: [2012-03-20 23:40:36,496] [INFO ] [regionserver20020.compactor] [org.apache.hadoop.hbase.regionserver.HRegion 563] Closed hbaseTable3,06640#000149,1332230348477.3139250177b9c55fbce6856e2595b272. [2012-03-20 23:40:38,469] [INFO ] [regionserver20020.compactor] [org.apache.hadoop.hbase.catalog.MetaEditor 85] Offlined parent region hbaseTable3,06640#000149,1332230348477.3139250177b9c55fbce6856e2595b272. in META [2012-03-20 23:40:39,755] [INFO ] [regionserver20020.compactor] [org.apache.hadoop.hbase.regionserver.CompactSplitThread 181] Region split, META updated, and report to master. Parent=hbaseTable3,06640#000149,1332230348477.3139250177b9c55fbce6856e2595b272., new regions: hbaseTable3,06640#000149,1332286834610.bf8baeae598db2a1e87dbd0a234d1539., hbaseTable3,06723#000707,1332286834610.64ccaffa46be50a5dbc41540006afcb6.. Split took 5sec 3.then the backup master active one, in finishInitialization() logs,I found those logs: [2012-03-21 11:41:46,692] [DEBUG] [master-158-1-131-48:20000] [org.apache.hadoop.hbase.master.handler.ServerShutdownHandler 348] Daughter hbaseTable3,06640#000149,1332286834610.bf8baeae598db2a1e87dbd0a234d1539. present 4.so I ensure that the parent region(3139250177b9c55fbce6856e2595b272) also in META table. 5.if 3139250177b9c55fbce6856e2595b272 in META, it will be added to AssignmentManager#regions and AssignmentManager#servers when master rebuild the user regions. 6.balance will reference to AssignmentManager#servers to let the 3139250177b9c55fbce6856e2595b272 to move: [2012-03-21 11:46:47,699] [INFO ] [158-1-131-48:20000-BalancerChore] [org.apache.hadoop.hbase.master.HMaster 849] balance hri=hbaseTable3,06640#000149,1332230348477.3139250177b9c55fbce6856e2595b272., src=158-1-131-48,20020,1331918756600, dest=158-1-130-11,20020,1331918756573 7.the parent will in RIT forever as PENDING_CLOSE state,thus balance will never be executed [2012-03-21 13:13:57,201] [WARN ] [PRI IPC Server handler 3 on 20020] [org.apache.hadoop.hbase.regionserver.HRegionServer 2211] Received close for region we are not serving; 3139250177b9c55fbce6856e2595b272 [2012-03-21 11:55:55,638] [INFO ] [158-1-131-48:20000.timeoutMonitor] [org.apache.hadoop.hbase.master.AssignmentManager 2327] Regions in transition timed out: hbaseTable3,06640#000149,1332230348477.3139250177b9c55fbce6856e2595b272. state=PENDING_CLOSE, ts=1332330775586 [2012-03-21 11:55:55,639] [INFO ] [158-1-131-48:20000.timeoutMonitor] [org.apache.hadoop.hbase.master.AssignmentManager 2363] Region has been PENDING_CLOSE for too long, running forced unassign again on region=hbaseTable3,06640#000149,1332230348477.3139250177b9c55fbce6856e2595b272.
        Hide
        xufeng added a comment -

        I use the 0.90
        BTW:I can not compile the 0.90 branch on location by maven.is this a problem?

        the error log is:

        [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile (default-compile) on project hbase: Compilation failure
        [ERROR] /opt/xufeng/module/hbase/host_java/src/HBASE_ONLINE/src/main/java/org/apache/hadoop/hbase/master/HMaster.java:[1121,22] cannot find symbol
        [ERROR] symbol  : class ServerName
        [ERROR] location: class org.apache.hadoop.hbase.master.HMaster
        
        Show
        xufeng added a comment - I use the 0.90 BTW:I can not compile the 0.90 branch on location by maven.is this a problem? the error log is: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile (default-compile) on project hbase: Compilation failure [ERROR] /opt/xufeng/module/hbase/host_java/src/HBASE_ONLINE/src/main/java/org/apache/hadoop/hbase/master/HMaster.java:[1121,22] cannot find symbol [ERROR] symbol : class ServerName [ERROR] location: class org.apache.hadoop.hbase.master.HMaster
        Hide
        xufeng added a comment -

        this is my patch, and I will reproduce the problem and validate the patch.

        anyone can review it and give me so suggestion?

        Show
        xufeng added a comment - this is my patch, and I will reproduce the problem and validate the patch. anyone can review it and give me so suggestion?
        Hide
        Ted Yu added a comment -

        @Xufeng:
        Can you clarify which region server was log from step 2 collected from ?
        Is it 158-1-131-48 ?

        Show
        Ted Yu added a comment - @Xufeng: Can you clarify which region server was log from step 2 collected from ? Is it 158-1-131-48 ?
        Hide
        xufeng added a comment -

        the log of step2 from 158-1-131-48,20020,1331918756600

        Show
        xufeng added a comment - the log of step2 from 158-1-131-48,20020,1331918756600
        Hide
        ramkrishna.s.vasudevan added a comment -

        +1 on patch for 0.90. Is this present in other versions also?

        Show
        ramkrishna.s.vasudevan added a comment - +1 on patch for 0.90. Is this present in other versions also?
        Hide
        Ted Yu added a comment -

        +1 as well.

        Please fix indentation:

               String tableName = regionInfo.getTableDesc().getNameAsString();
        +	  if (regionInfo.isOffline() && regionInfo.isSplit()) continue;
        
        Show
        Ted Yu added a comment - +1 as well. Please fix indentation: String tableName = regionInfo.getTableDesc().getNameAsString(); + if (regionInfo.isOffline() && regionInfo.isSplit()) continue ;
        Hide
        xufeng added a comment -

        reproduce this issue by 0.90
        In this issue,META should hold parent region info for long time.So before test,I delete those code in regionserver class:

          public void postOpenDeployTasks(final HRegion r, final CatalogTracker ct,
              final boolean daughter)
          throws KeeperException, IOException {
            // Do checks to see if we need to compact (references or too many files)
            /*if (r.hasReferences() || r.hasTooManyStoreFiles()) {
              getCompactionRequester().requestCompaction(r,
                r.hasReferences()? "Region has references on open" :
                  "Region has too many store files");
            }*/
        

        step1:start cluster that has two master and one regionerver process.
        step2:create a table and input some data in it.
        step3:split the table by shell.
        step4:kill the active master.
        step5:after backup master become active one,start another regionserver process.
        result:the issue happen

        I also test my patch many times and it can work.

        Show
        xufeng added a comment - reproduce this issue by 0.90 In this issue,META should hold parent region info for long time.So before test,I delete those code in regionserver class: public void postOpenDeployTasks(final HRegion r, final CatalogTracker ct, final boolean daughter) throws KeeperException, IOException { // Do checks to see if we need to compact (references or too many files) /*if (r.hasReferences() || r.hasTooManyStoreFiles()) { getCompactionRequester().requestCompaction(r, r.hasReferences()? "Region has references on open" : "Region has too many store files"); }*/ step1:start cluster that has two master and one regionerver process. step2:create a table and input some data in it. step3:split the table by shell. step4:kill the active master. step5:after backup master become active one,start another regionserver process. result:the issue happen I also test my patch many times and it can work.
        Hide
        xufeng added a comment -

        @Zhihong Yu
        updated the patch.
        this patch for 0.90 version

        @Rama
        I will check the TRUNK and 0.92 version.

        Show
        xufeng added a comment - @Zhihong Yu updated the patch. this patch for 0.90 version @Rama I will check the TRUNK and 0.92 version.
        Hide
        xufeng added a comment -

        Submit the unit test result for 90 patch.

        There are some err in result after patched.

        But those err also be found if no patched.

        Show
        xufeng added a comment - Submit the unit test result for 90 patch. There are some err in result after patched. But those err also be found if no patched.
        Hide
        gaojinchao added a comment -

        +1

        Show
        gaojinchao added a comment - +1
        Hide
        Ted Yu added a comment -

        Integrated to 0.90 branch.

        Thanks for the patch Xufeng.

        Thanks for the review Ramkrishna and Jinchao.

        Patch for TRUNK to follow

        Show
        Ted Yu added a comment - Integrated to 0.90 branch. Thanks for the patch Xufeng. Thanks for the review Ramkrishna and Jinchao. Patch for TRUNK to follow
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12519805/5615-trunk.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks
        org.apache.hadoop.hbase.mapreduce.TestImportTsv
        org.apache.hadoop.hbase.mapred.TestTableMapReduce
        org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1298//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1298//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1298//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12519805/5615-trunk.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1298//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1298//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1298//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        Integrated to 0.92, 0.94 and TRUNK as well.

        Show
        Ted Yu added a comment - Integrated to 0.92, 0.94 and TRUNK as well.
        Hide
        xufeng added a comment -

        Thanks for help Ramkrishna,Jinchao and Ted.

        Show
        xufeng added a comment - Thanks for help Ramkrishna,Jinchao and Ted.
        Hide
        Ted Yu added a comment -

        @Xufeng:
        You're welcome.

        In the future, please grant license to Apache when you attach patches.

        Show
        Ted Yu added a comment - @Xufeng: You're welcome. In the future, please grant license to Apache when you attach patches.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2695 (See https://builds.apache.org/job/HBase-TRUNK/2695/)
        HBASE-5615 the master never does balance because of balancing the parent region (Xufeng) (Revision 1305171)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2695 (See https://builds.apache.org/job/HBase-TRUNK/2695/ ) HBASE-5615 the master never does balance because of balancing the parent region (Xufeng) (Revision 1305171) Result = FAILURE tedyu : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #55 (See https://builds.apache.org/job/HBase-0.94/55/)
        HBASE-5615 the master never does balance because of balancing the parent region (Xufeng) (Revision 1305172)

        Result = SUCCESS
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #55 (See https://builds.apache.org/job/HBase-0.94/55/ ) HBASE-5615 the master never does balance because of balancing the parent region (Xufeng) (Revision 1305172) Result = SUCCESS tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92 #339 (See https://builds.apache.org/job/HBase-0.92/339/)
        HBASE-5615 the master never does balance because of balancing the parent region (Xufeng) (Revision 1305173)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/branches/0.92/CHANGES.txt
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.92 #339 (See https://builds.apache.org/job/HBase-0.92/339/ ) HBASE-5615 the master never does balance because of balancing the parent region (Xufeng) (Revision 1305173) Result = FAILURE tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-security #150 (See https://builds.apache.org/job/HBase-TRUNK-security/150/)
        HBASE-5615 the master never does balance because of balancing the parent region (Xufeng) (Revision 1305171)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-security #150 (See https://builds.apache.org/job/HBase-TRUNK-security/150/ ) HBASE-5615 the master never does balance because of balancing the parent region (Xufeng) (Revision 1305171) Result = FAILURE tedyu : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #3 (See https://builds.apache.org/job/HBase-0.94-security/3/)
        HBASE-5615 the master never does balance because of balancing the parent region (Xufeng) (Revision 1305172)

        Result = ABORTED
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #3 (See https://builds.apache.org/job/HBase-0.94-security/3/ ) HBASE-5615 the master never does balance because of balancing the parent region (Xufeng) (Revision 1305172) Result = ABORTED tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92-security #104 (See https://builds.apache.org/job/HBase-0.92-security/104/)
        HBASE-5615 the master never does balance because of balancing the parent region (Xufeng) (Revision 1305173)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/branches/0.92/CHANGES.txt
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.92-security #104 (See https://builds.apache.org/job/HBase-0.92-security/104/ ) HBASE-5615 the master never does balance because of balancing the parent region (Xufeng) (Revision 1305173) Result = FAILURE tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        ramkrishna.s.vasudevan added a comment -

        This issue fix is not valid for 0.92 and above branches as it uses ZK for splitting.

        Show
        ramkrishna.s.vasudevan added a comment - This issue fix is not valid for 0.92 and above branches as it uses ZK for splitting.
        Hide
        ramkrishna.s.vasudevan added a comment - - edited

        @Xufeng

        @Rama

        bq .I will check the TRUNK and 0.92 version.

        Did u check for the trunk and 0.92?

        Show
        ramkrishna.s.vasudevan added a comment - - edited @Xufeng @Rama bq .I will check the TRUNK and 0.92 version. Did u check for the trunk and 0.92?
        Hide
        Lars Hofhansl added a comment -

        @Ram: Are you saying we should revert from 0.92, 0.94, and 0.96?

        Show
        Lars Hofhansl added a comment - @Ram: Are you saying we should revert from 0.92, 0.94, and 0.96?
        Hide
        Lars Hofhansl added a comment -

        The patch will definitely do no harm in 0.92+ (although it might be ineffective)

        Show
        Lars Hofhansl added a comment - The patch will definitely do no harm in 0.92+ (although it might be ineffective)
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Lars
        I suspect the following problem.
        The split of region is done using ZK. Now suppose just after the RS updated the status of the region split node to RS_SPLIT and before the master could respond master went down.
        Now the master comes up and he will expect the RIT to be populated from the servers map.
        But with this patch that servers map is not populated and the RIT is not populated and the master is not able to process the SPLIT node and the RS keeps saying master yet to process. This has to be reverted in all the 0.92+ branches i feel.
        Correct me if am wrong. 0.90 is fine.

        Show
        ramkrishna.s.vasudevan added a comment - @Lars I suspect the following problem. The split of region is done using ZK. Now suppose just after the RS updated the status of the region split node to RS_SPLIT and before the master could respond master went down. Now the master comes up and he will expect the RIT to be populated from the servers map. But with this patch that servers map is not populated and the RIT is not populated and the master is not able to process the SPLIT node and the RS keeps saying master yet to process. This has to be reverted in all the 0.92+ branches i feel. Correct me if am wrong. 0.90 is fine.
        Hide
        Ted Yu added a comment -

        RS_SPLIT above refers to RS_ZK_REGION_SPLIT, right ?

        regionsInTransition isn't filled by rebuildUserRegions(). If you can provide a test case for the scenario described above, that would help me understand.

        Show
        Ted Yu added a comment - RS_SPLIT above refers to RS_ZK_REGION_SPLIT, right ? regionsInTransition isn't filled by rebuildUserRegions(). If you can provide a test case for the scenario described above, that would help me understand.
        Hide
        Lars Hofhansl added a comment -

        @Ram: I believe you, but I don't follow
        Since you have an opinion and I don't know much about this. Let's revert from 0.92+ until we know more.

        Show
        Lars Hofhansl added a comment - @Ram: I believe you, but I don't follow Since you have an opinion and I don't know much about this. Let's revert from 0.92+ until we know more.
        Hide
        ramkrishna.s.vasudevan added a comment -

        RS_SPLIT above refers to RS_ZK_REGION_SPLIT, right ?

        Yes.
        I will explain the problem first once again with the code
        The patch does the following change while rebuildUserRegions on master startup.

              if (regionInfo.isOffline() && regionInfo.isSplit()) continue;
        

        Take the case where the RS was splitting a region. In SplitTransaction

        try {
                this.znodeVersion = transitionNodeSplit(server.getZooKeeper(),
                  parent.getRegionInfo(), a.getRegionInfo(), b.getRegionInfo(),
                  server.getServerName(), this.znodeVersion);
        
                int spins = 0;
        

        After doing the above step RS waits for the master to respond for this change done in znode.

        if (spins % 10 == 0) {
                    LOG.debug("Still waiting on the master to process the split for " +
                        this.parent.getRegionInfo().getEncodedName());
                  }
                  Thread.sleep(100);
                  // When this returns -1 it means the znode doesn't exist
                  this.znodeVersion = tickleNodeSplit(server.getZooKeeper(),
                    parent.getRegionInfo(), a.getRegionInfo(), b.getRegionInfo(),
                    server.getServerName(), this.znodeVersion);
        

        But the master had gone down. So RS will keep waiting here
        Now in master side when master comes up, the master tries to form all the existing regions and their corresponding servers in AM.rebuildUserRegions()

                if (false == checkIfRegionBelongsToDisabled(regionInfo)
                    && false == checkIfRegionsBelongsToEnabling(regionInfo)) {
                  regions.put(regionInfo, regionLocation);
                  addToServers(regionLocation, regionInfo);
                }
        

        As the region was already splitted the current fix in this patch will not allow me to continue to the above step where i addToServers.
        Now because the RS keeps on tickling the node to SPLIT state in AM.handleRegion

        case RS_ZK_REGION_SPLIT:
                  // RegionState must be null, or SPLITTING or PENDING_CLOSE.
                  if (!isInStateForSplitting(regionState)) break;
                  // If null, add SPLITTING state before going to SPLIT
                  if (regionState == null) {
                    regionState = addSplittingToRIT(sn, encodedName);
        

        We see that the current regionState is null as no entry is present in RIT for that splitted region. As the regionState is null we fist try to get the RIT populated

         private HRegionInfo findHRegionInfo(final ServerName sn,
              final String encodedName) {
            if (!this.serverManager.isServerOnline(sn)) return null;
            Set<HRegionInfo> hris = this.servers.get(sn);
            HRegionInfo foundHri = null;
            for (HRegionInfo hri: hris) {
              if (hri.getEncodedName().equals(encodedName)) {
                foundHri = hri;
                break;
              }
            }
            return foundHri;
          }
        

        But my servers map doesnot have this region. So it will always be null and master will not process the SPLIT.
        I reverted the patch and i was able to overcome the problem . We need to make the fix for 0.92+ branches considering these scenarios.

        Show
        ramkrishna.s.vasudevan added a comment - RS_SPLIT above refers to RS_ZK_REGION_SPLIT, right ? Yes. I will explain the problem first once again with the code The patch does the following change while rebuildUserRegions on master startup. if (regionInfo.isOffline() && regionInfo.isSplit()) continue ; Take the case where the RS was splitting a region. In SplitTransaction try { this .znodeVersion = transitionNodeSplit(server.getZooKeeper(), parent.getRegionInfo(), a.getRegionInfo(), b.getRegionInfo(), server.getServerName(), this .znodeVersion); int spins = 0; After doing the above step RS waits for the master to respond for this change done in znode. if (spins % 10 == 0) { LOG.debug( "Still waiting on the master to process the split for " + this .parent.getRegionInfo().getEncodedName()); } Thread .sleep(100); // When this returns -1 it means the znode doesn't exist this .znodeVersion = tickleNodeSplit(server.getZooKeeper(), parent.getRegionInfo(), a.getRegionInfo(), b.getRegionInfo(), server.getServerName(), this .znodeVersion); But the master had gone down. So RS will keep waiting here Now in master side when master comes up, the master tries to form all the existing regions and their corresponding servers in AM.rebuildUserRegions() if ( false == checkIfRegionBelongsToDisabled(regionInfo) && false == checkIfRegionsBelongsToEnabling(regionInfo)) { regions.put(regionInfo, regionLocation); addToServers(regionLocation, regionInfo); } As the region was already splitted the current fix in this patch will not allow me to continue to the above step where i addToServers. Now because the RS keeps on tickling the node to SPLIT state in AM.handleRegion case RS_ZK_REGION_SPLIT: // RegionState must be null , or SPLITTING or PENDING_CLOSE. if (!isInStateForSplitting(regionState)) break ; // If null , add SPLITTING state before going to SPLIT if (regionState == null ) { regionState = addSplittingToRIT(sn, encodedName); We see that the current regionState is null as no entry is present in RIT for that splitted region. As the regionState is null we fist try to get the RIT populated private HRegionInfo findHRegionInfo( final ServerName sn, final String encodedName) { if (! this .serverManager.isServerOnline(sn)) return null ; Set<HRegionInfo> hris = this .servers.get(sn); HRegionInfo foundHri = null ; for (HRegionInfo hri: hris) { if (hri.getEncodedName().equals(encodedName)) { foundHri = hri; break ; } } return foundHri; } But my servers map doesnot have this region. So it will always be null and master will not process the SPLIT. I reverted the patch and i was able to overcome the problem . We need to make the fix for 0.92+ branches considering these scenarios.
        Hide
        ramkrishna.s.vasudevan added a comment -

        I will try to come up with a patch.

        Show
        ramkrishna.s.vasudevan added a comment - I will try to come up with a patch.
        Hide
        Ted Yu added a comment -

        Thanks for the finding, Ram.

        Reverted from 0.92, 0.94 and trunk.

        Show
        Ted Yu added a comment - Thanks for the finding, Ram. Reverted from 0.92, 0.94 and trunk.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2719 (See https://builds.apache.org/job/HBase-TRUNK/2719/)
        HBASE-5615 revert due to race condition in case master dies (Revision 1310324)

        Result = SUCCESS
        tedyu :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2719 (See https://builds.apache.org/job/HBase-TRUNK/2719/ ) HBASE-5615 revert due to race condition in case master dies (Revision 1310324) Result = SUCCESS tedyu : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #92 (See https://builds.apache.org/job/HBase-0.94/92/)
        HBASE-5615 revert due to race condition in case master dies (Revision 1310322)

        Result = SUCCESS
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #92 (See https://builds.apache.org/job/HBase-0.94/92/ ) HBASE-5615 revert due to race condition in case master dies (Revision 1310322) Result = SUCCESS tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92 #359 (See https://builds.apache.org/job/HBase-0.92/359/)
        HBASE-5615 revert due to race condition in case master dies (Revision 1310321)

        Result = SUCCESS
        tedyu :
        Files :

        • /hbase/branches/0.92/CHANGES.txt
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.92 #359 (See https://builds.apache.org/job/HBase-0.92/359/ ) HBASE-5615 revert due to race condition in case master dies (Revision 1310321) Result = SUCCESS tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        Lars Hofhansl added a comment -

        Thanks for the explanation Ram!
        Now, do you think this is even a problem in 0.92+, or can we close this issue?

        Show
        Lars Hofhansl added a comment - Thanks for the explanation Ram! Now, do you think this is even a problem in 0.92+, or can we close this issue?
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-security #161 (See https://builds.apache.org/job/HBase-TRUNK-security/161/)
        HBASE-5615 revert due to race condition in case master dies (Revision 1310324)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-security #161 (See https://builds.apache.org/job/HBase-TRUNK-security/161/ ) HBASE-5615 revert due to race condition in case master dies (Revision 1310324) Result = FAILURE tedyu : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #8 (See https://builds.apache.org/job/HBase-0.94-security/8/)
        HBASE-5615 revert due to race condition in case master dies (Revision 1310322)

        Result = SUCCESS
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #8 (See https://builds.apache.org/job/HBase-0.94-security/8/ ) HBASE-5615 revert due to race condition in case master dies (Revision 1310322) Result = SUCCESS tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        Lars Hofhansl added a comment -

        I'm a bit confused about whether this is an issue for 0.92+ or not.
        Ram's argument seems to imply that it is not.

        Show
        Lars Hofhansl added a comment - I'm a bit confused about whether this is an issue for 0.92+ or not. Ram's argument seems to imply that it is not.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Lars
        Sorry for not able to comment on this issue over the weekend.
        The issue is present in 0.92+ but the fix is not this. We may have to work on the correct fix for this.

        Show
        ramkrishna.s.vasudevan added a comment - @Lars Sorry for not able to comment on this issue over the weekend. The issue is present in 0.92+ but the fix is not this. We may have to work on the correct fix for this.
        Hide
        Lars Hofhansl added a comment -

        Thanks Ram... I'll hold up 0.94 for this (unless you think that unnecessary).

        Show
        Lars Hofhansl added a comment - Thanks Ram... I'll hold up 0.94 for this (unless you think that unnecessary).
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Lars
        I feel we need not hold up 0.94 for this. I also need some more time to work on a proper fix. Atleast for now reverting the fix will not lead to some already handled scenarios and which looks clean.

        Show
        ramkrishna.s.vasudevan added a comment - @Lars I feel we need not hold up 0.94 for this. I also need some more time to work on a proper fix. Atleast for now reverting the fix will not lead to some already handled scenarios and which looks clean.
        Hide
        Lars Hofhansl added a comment -

        Moving to 0.94.1 at Ram's recommendation.

        Show
        Lars Hofhansl added a comment - Moving to 0.94.1 at Ram's recommendation.
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92-security #105 (See https://builds.apache.org/job/HBase-0.92-security/105/)
        HBASE-5615 revert due to race condition in case master dies (Revision 1310321)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/branches/0.92/CHANGES.txt
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.92-security #105 (See https://builds.apache.org/job/HBase-0.92-security/105/ ) HBASE-5615 revert due to race condition in case master dies (Revision 1310321) Result = FAILURE tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        ramkrishna.s.vasudevan added a comment -

        Resolving this issue as it is fixed in 0.90.
        The fix for other versions have been reverted and the same has been taken care in HBASE-5806

        Show
        ramkrishna.s.vasudevan added a comment - Resolving this issue as it is fixed in 0.90. The fix for other versions have been reverted and the same has been taken care in HBASE-5806

          People

          • Assignee:
            xufeng
            Reporter:
            xufeng
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Development