HBase
  1. HBase
  2. HBASE-6389

Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.94.0, 0.95.2
    • Fix Version/s: 0.94.3, 0.95.0
    • Component/s: master
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      Reverts the cluster startup behavior to pre 0.94.0.

      With this, Master will wait until "hbase.master.wait.on.regionservers.mintostart" number of Region Servers have registered with it before it starts region assignment. The default value of this setting is 1.

      In large clusters with thousands of regions you may want to increase this to a higher number which is sufficient to handle the task of opening those region in parallel.

      If left to the default, at times, the Master could assign all regions to a single Region Server which will result in slow startup and in worst case could OOM the Region Server (some time resulting in META inconsistency).

      Here is how it works now (from the javadoc):

      We wait until one of these condition is met:
       - the master is stopped
       - the 'hbase.master.wait.on.regionservers.maxtostart' number of region servers is reached
       - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND there have been no new region server in for 'hbase.master.wait.on.regionservers.interval' time AND the 'hbase.master.wait.on.regionservers.timeout' is reached
      Show
      Reverts the cluster startup behavior to pre 0.94.0. With this, Master will wait until "hbase.master.wait.on.regionservers.mintostart" number of Region Servers have registered with it before it starts region assignment. The default value of this setting is 1. In large clusters with thousands of regions you may want to increase this to a higher number which is sufficient to handle the task of opening those region in parallel. If left to the default, at times, the Master could assign all regions to a single Region Server which will result in slow startup and in worst case could OOM the Region Server (some time resulting in META inconsistency). Here is how it works now (from the javadoc): We wait until one of these condition is met:  - the master is stopped  - the 'hbase.master.wait.on.regionservers.maxtostart' number of region servers is reached  - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND there have been no new region server in for 'hbase.master.wait.on.regionservers.interval' time AND the 'hbase.master.wait.on.regionservers.timeout' is reached

      Description

      Continuing from HBASE-6375.

      It seems I was mistaken in my assumption that changing the value of "hbase.master.wait.on.regionservers.mintostart" to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s).

      While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993.

      From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if "hbase.master.wait.on.regionservers.mintostart" has not reached.

      Reading the current conditions of waitForRegionServers() clarifies it

      ServerManager.java (trunk rev:1360470)
      ....
      581	  /**
      582	   * Wait for the region servers to report in.
      583	   * We will wait until one of this condition is met:
      584	   *  - the master is stopped
      585	   *  - the 'hbase.master.wait.on.regionservers.timeout' is reached
      586	   *  - the 'hbase.master.wait.on.regionservers.maxtostart' number of
      587	   *    region servers is reached
      588	   *  - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND
      589	   *   there have been no new region server in for
      590	   *      'hbase.master.wait.on.regionservers.interval' time
      591	   *
      592	   * @throws InterruptedException
      593	   */
      594	  public void waitForRegionServers(MonitoredTask status)
      595	  throws InterruptedException {
      ....
      ....
      612	    while (
      613	      !this.master.isStopped() &&
      614	        slept < timeout &&
      615	        count < maxToStart &&
      616	        (lastCountChange+interval > now || count < minToStart)
      617	      ){
      ....
      

      So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone.

      As mentioned in HBASE-4993, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on.

      To enforce the required quorum as specified by "hbase.master.wait.on.regionservers.mintostart" irrespective of timeout, these conditions need to be modified as following

      ServerManager.java
      ..
        /**
         * Wait for the region servers to report in.
         * We will wait until one of this condition is met:
         *  - the master is stopped
         *  - the 'hbase.master.wait.on.regionservers.maxtostart' number of
         *    region servers is reached
         *  - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND
         *   there have been no new region server in for
         *      'hbase.master.wait.on.regionservers.interval' time AND
         *   the 'hbase.master.wait.on.regionservers.timeout' is reached
         *
         * @throws InterruptedException
         */
        public void waitForRegionServers(MonitoredTask status)
      ..
      ..
          int minToStart = this.master.getConfiguration().
          getInt("hbase.master.wait.on.regionservers.mintostart", 1);
          int maxToStart = this.master.getConfiguration().
          getInt("hbase.master.wait.on.regionservers.maxtostart", Integer.MAX_VALUE);
          if (maxToStart < minToStart) {
            maxToStart = minToStart;
          }
      ..
      ..
          while (
            !this.master.isStopped() &&
              count < maxToStart &&
              (lastCountChange+interval > now || timeout > slept || count < minToStart)
            ){
      ..
      
      1. HBASE-6389_trunk.patch
        3 kB
        Aditya Kishore
      2. HBASE-6389_trunk.patch
        5 kB
        Aditya Kishore
      3. HBASE-6389_trunk.patch
        5 kB
        Aditya Kishore
      4. org.apache.hadoop.hbase.TestZooKeeper-output.txt
        120 kB
        Ted Yu
      5. testReplication.jstack
        204 kB
        Ted Yu
      6. HBASE-6389_trunk_v2.patch
        11 kB
        Aditya Kishore
      7. HBASE-6389_trunk_v2.patch
        11 kB
        Aditya Kishore
      8. HBASE-6389_0.94.patch
        9 kB
        Aditya Kishore

        Activity

        Hide
        Himanshu Vashishtha added a comment -

        StackWhat is the reason for having the "incompatible change" flag?

        Show
        Himanshu Vashishtha added a comment - Stack What is the reason for having the "incompatible change" flag?
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #82 (See https://builds.apache.org/job/HBase-0.94-security/82/)
        HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1405440)

        Result = SUCCESS
        larsh :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #82 (See https://builds.apache.org/job/HBase-0.94-security/82/ ) HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1405440) Result = SUCCESS larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security-on-Hadoop-23 #9 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/9/)
        HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1405440)

        Result = FAILURE
        larsh :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security-on-Hadoop-23 #9 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/9/ ) HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1405440) Result = FAILURE larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #248 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/248/)
        HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Revision 1405111)

        Result = FAILURE
        stack :
        Files :

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/TestZooKeeper.java
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterNoCluster.java
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #248 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/248/ ) HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Revision 1405111) Result = FAILURE stack : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/TestZooKeeper.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterNoCluster.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #571 (See https://builds.apache.org/job/HBase-0.94/571/)
        HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1405440)

        Result = SUCCESS
        larsh :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #571 (See https://builds.apache.org/job/HBase-0.94/571/ ) HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1405440) Result = SUCCESS larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Hide
        Lars Hofhansl added a comment -

        Committed to 0.94 as well.

        Show
        Lars Hofhansl added a comment - Committed to 0.94 as well.
        Hide
        Lars Hofhansl added a comment -

        I'm going to commit the 0.94 patch.

        Show
        Lars Hofhansl added a comment - I'm going to commit the 0.94 patch.
        Hide
        stack added a comment -

        I added a note on this config. to the reference guide under required configurations. Also updated the release note to include Adhitya's clean javadoc excerpt on how this now works.

        Show
        stack added a comment - I added a note on this config. to the reference guide under required configurations. Also updated the release note to include Adhitya's clean javadoc excerpt on how this now works.
        Hide
        Aditya Kishore added a comment -

        Thanks Stack and Ted for reviewing the changes.

        Show
        Aditya Kishore added a comment - Thanks Stack and Ted for reviewing the changes.
        Hide
        stack added a comment -

        Committed to trunk. Thanks for the clarifications Aditya (and the fix). Lars, you want this for 0.94.3? I marked it as an 'incompatible change' but this is a bug fix, and the behavior now more rational.

        Show
        stack added a comment - Committed to trunk. Thanks for the clarifications Aditya (and the fix). Lars, you want this for 0.94.3? I marked it as an 'incompatible change' but this is a bug fix, and the behavior now more rational.
        Hide
        Aditya Kishore added a comment -

        I guess you meant: if the minimum number has not reached

        No, timeout applies in conjunction with the minimum number of RS. As the java-doc describes, the wait ends as soon as even one of the 3 conditions are met. The first 2 are straightforward. The last one is a conjunction of 3 conditions and all of them must become true for the overall condition to become true and for wait to end.

        Which means (timeout has elapsed) AND (quorum of RS has registered) AND (no new RS has has registered in last 1.5 seconds).

        Show
        Aditya Kishore added a comment - I guess you meant: if the minimum number has not reached No, timeout applies in conjunction with the minimum number of RS. As the java-doc describes, the wait ends as soon as even one of the 3 conditions are met. The first 2 are straightforward. The last one is a conjunction of 3 conditions and all of them must become true for the overall condition to become true and for wait to end. Which means (timeout has elapsed) AND (quorum of RS has registered) AND (no new RS has has registered in last 1.5 seconds).
        Hide
        Ted Yu added a comment -

        The wait timeout comes into effect only if the minimum number has reached within the timeout period.

        I guess you meant: if the minimum number has not reached

        Show
        Ted Yu added a comment - The wait timeout comes into effect only if the minimum number has reached within the timeout period. I guess you meant: if the minimum number has not reached
        Hide
        Aditya Kishore added a comment -

        Master waits infinitely until the minimum number of Region Servers have cheked-in.

        This protects against overloading a small number of RSes with the initial region assignment. With the default of 1, most of the cluster will start assignment after 4.5 seconds (again default) even if a single RS has registered. But this setting gives an option to delay the assignments in large clusters until a good number of RSes have registered.

        The wait timeout comes into effect only if the minimum number has reached within the timeout period.

        I have updated the conditions describing the behavior in the Java-doc as well.

          /**
           * Wait for the region servers to report in.
           * We will wait until one of these condition is met:
           *  - the master is stopped
           *  - the 'hbase.master.wait.on.regionservers.maxtostart' number of
           *    region servers is reached
           *  - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND
           *   there have been no new region server in for
           *      'hbase.master.wait.on.regionservers.interval' time AND
           *   the 'hbase.master.wait.on.regionservers.timeout' is reached
           *
           * @throws InterruptedException
           */
        
        Show
        Aditya Kishore added a comment - Master waits infinitely until the minimum number of Region Servers have cheked-in. This protects against overloading a small number of RSes with the initial region assignment. With the default of 1, most of the cluster will start assignment after 4.5 seconds (again default) even if a single RS has registered. But this setting gives an option to delay the assignments in large clusters until a good number of RSes have registered. The wait timeout comes into effect only if the minimum number has reached within the timeout period. I have updated the conditions describing the behavior in the Java-doc as well. /** * Wait for the region servers to report in. * We will wait until one of these condition is met: * - the master is stopped * - the 'hbase.master.wait.on.regionservers.maxtostart' number of * region servers is reached * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND * there have been no new region server in for * 'hbase.master.wait.on.regionservers.interval' time AND * the 'hbase.master.wait.on.regionservers.timeout' is reached * * @ throws InterruptedException */
        Hide
        stack added a comment -

        Aditya Kishore So, what happens on startup now (I don't recall how it worked way back in 0.92!). We wait for how long for the minimum servers to check in? Once that period has elapsed, we start the assign to whoever has checked in? Thanks (I can take care of making a note in important configurations section, np).

        Show
        stack added a comment - Aditya Kishore So, what happens on startup now (I don't recall how it worked way back in 0.92!). We wait for how long for the minimum servers to check in? Once that period has elapsed, we start the assign to whoever has checked in? Thanks (I can take care of making a note in important configurations section, np).
        Hide
        Lars Hofhansl added a comment -

        I do not know this area well.
        ramkrishna.s.vasudevan Wanna have a look too?

        Show
        Lars Hofhansl added a comment - I do not know this area well. ramkrishna.s.vasudevan Wanna have a look too?
        Hide
        Ted Yu added a comment -

        Hadoop QA only checks patch for trunk.

        Show
        Ted Yu added a comment - Hadoop QA only checks patch for trunk.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12551509/HBASE-6389_0.94.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3199//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12551509/HBASE-6389_0.94.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 9 new or modified tests. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3199//console This message is automatically generated.
        Hide
        Aditya Kishore added a comment -

        Not sure how to submit it for 0.94 builds. Ted, could you please help me out here.

        Show
        Aditya Kishore added a comment - Not sure how to submit it for 0.94 builds. Ted, could you please help me out here.
        Hide
        Aditya Kishore added a comment -

        Attaching patch for 0.94 branch. The patch passes full test suit on my test machine.

        Show
        Aditya Kishore added a comment - Attaching patch for 0.94 branch. The patch passes full test suit on my test machine.
        Hide
        Ted Yu added a comment -

        +1 on patch.

        Show
        Ted Yu added a comment - +1 on patch.
        Hide
        Aditya Kishore added a comment -

        And yes, I think this is important for 0.94 branch as well judging from the number of instances I have seen this impacting some of our large deployments which were running with the default value of 1 in both 0.90 and 0.92.

        The configuration too needs to be part of "Important configuration". Will file separate JIRA for doc update, if you agree.

        Show
        Aditya Kishore added a comment - And yes, I think this is important for 0.94 branch as well judging from the number of instances I have seen this impacting some of our large deployments which were running with the default value of 1 in both 0.90 and 0.92. The configuration too needs to be part of "Important configuration". Will file separate JIRA for doc update, if you agree.
        Hide
        Aditya Kishore added a comment -

        stack

        Have added the release note describing the behavior change, kindly review.

        I have tested it in following scenario while setting the min value to higher than 1 (mostly 2):

        0. Normal cluster startup.
        1. Cluster start with 1 master and 1 region server and delaying the start of other RSes by 10s of seconds.
        2. Same as 1 but with 2 masters and then failover the master once the cluster is up and running.
        3. Same as 2 but with stopping the RSes while failing over the Master and bringing only 1 RS back while other joining later.
        4. Some random Master/RS failover.

        Show
        Aditya Kishore added a comment - stack Have added the release note describing the behavior change, kindly review. I have tested it in following scenario while setting the min value to higher than 1 (mostly 2): 0. Normal cluster startup. 1. Cluster start with 1 master and 1 region server and delaying the start of other RSes by 10s of seconds. 2. Same as 1 but with 2 masters and then failover the master once the cluster is up and running. 3. Same as 2 but with stopping the RSes while failing over the Master and bringing only 1 RS back while other joining later. 4. Some random Master/RS failover.
        Hide
        stack added a comment -

        Aditya Kishore Patch looks good Aditya. Mind describing in release note how this patch changes startup? What kind of testing do you do sir? I can commit to trunk. Should it go into 0.94 too?

        Show
        stack added a comment - Aditya Kishore Patch looks good Aditya. Mind describing in release note how this patch changes startup? What kind of testing do you do sir? I can commit to trunk. Should it go into 0.94 too?
        Hide
        Aditya Kishore added a comment -
        Show
        Aditya Kishore added a comment - The test failure is unrelated and most likely is due to the same reason as observed here https://issues.apache.org/jira/browse/HBASE-6707?focusedCommentId=13469675&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13469665
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12549763/HBASE-6389_trunk_v2.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 15 new or modified tests.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        -1 javadoc. The javadoc tool appears to have generated 82 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.backup.example.TestZooKeeperTableArchiveClient

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12549763/HBASE-6389_trunk_v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 15 new or modified tests. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 82 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 findbugs . The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.backup.example.TestZooKeeperTableArchiveClient Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3083//console This message is automatically generated.
        Hide
        Aditya Kishore added a comment -

        Resubmitting the patch with some minor changes.

        Show
        Aditya Kishore added a comment - Resubmitting the patch with some minor changes.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12549692/HBASE-6389_trunk_v2.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 15 new or modified tests.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        -1 javadoc. The javadoc tool appears to have generated 82 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3076//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3076//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3076//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3076//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3076//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3076//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3076//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12549692/HBASE-6389_trunk_v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 15 new or modified tests. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 82 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3076//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3076//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3076//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3076//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3076//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3076//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3076//console This message is automatically generated.
        Hide
        Aditya Kishore added a comment -

        The new patch addresses test failures in couple of test cases (including TestReplication, TestZookeeper).

        Also, since the config params are used in quite a few places, I have moved them as a public constant.

        Please note that there is no change in the logic of the fix. This only fixes few bugs in test cases which surfaced due to the fix.

        I'll submit the review request later during the day.

        Show
        Aditya Kishore added a comment - The new patch addresses test failures in couple of test cases (including TestReplication, TestZookeeper). Also, since the config params are used in quite a few places, I have moved them as a public constant. Please note that there is no change in the logic of the fix. This only fixes few bugs in test cases which surfaced due to the fix. I'll submit the review request later during the day.
        Hide
        Lars Hofhansl added a comment -

        I am removing this from 0.94. We can still aim for 0.96.
        Pull back if you disagree.

        Show
        Lars Hofhansl added a comment - I am removing this from 0.94. We can still aim for 0.96. Pull back if you disagree.
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security-on-Hadoop-23 #6 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/6/)
        HBASE-6389 revert (Revision 1363193)
        HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1361458)

        Result = FAILURE
        larsh :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java

        larsh :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security-on-Hadoop-23 #6 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/6/ ) HBASE-6389 revert (Revision 1363193) HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1361458) Result = FAILURE larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #44 (See https://builds.apache.org/job/HBase-0.94-security/44/)
        HBASE-6389 revert (Revision 1363193)

        Result = FAILURE
        larsh :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #44 (See https://builds.apache.org/job/HBase-0.94-security/44/ ) HBASE-6389 revert (Revision 1363193) Result = FAILURE larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Hide
        stack added a comment -

        @Lars Its up to you (But since you asked, fine by me ... I like what you are doing though Aditya... thanks for the help).

        Show
        stack added a comment - @Lars Its up to you (But since you asked, fine by me ... I like what you are doing though Aditya... thanks for the help).
        Hide
        Aditya Kishore added a comment -

        BTW what do Ri, C and Fi represent in the formula above ?

        'n' is the number of tables in the cluster, Ri is the number of regions and CFi is the number of column families in table 'i' [1].

        1. MSLAB is ON by default

        Show
        Aditya Kishore added a comment - BTW what do Ri, C and Fi represent in the formula above ? 'n' is the number of tables in the cluster, R i is the number of regions and CF i is the number of column families in table 'i' [1] . 1. MSLAB is ON by default
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12537286/testReplication.jstack
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 12 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2416//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537286/testReplication.jstack against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2416//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        jstack for the hanging TestReplication

        Show
        Ted Yu added a comment - jstack for the hanging TestReplication
        Hide
        Ted Yu added a comment - - edited

        I ran test suite with latest patch on trunk and got:

        Failed tests:   testRunThriftServer[12](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:<1> but was:<0>
          testRunThriftServer[14](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:<1> but was:<0>
          testRunThriftServer[15](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:<1> but was:<0>
          testRunThriftServer[16](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:<1> but was:<0>
          testRunThriftServer[17](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:<1> but was:<0>
        
        Tests in error:
          testRegionCaching(org.apache.hadoop.hbase.client.TestHCM): org.apache.hadoop.hbase.UnknownRegionException: bd992463917ba68fe5389c5bf9e94a3a
          testCloseRegionThatFetchesTheHRIFromMeta(org.apache.hadoop.hbase.client.TestAdmin): -1
          testTableExists(org.apache.hadoop.hbase.catalog.TestMetaReaderEditor): org.apache.hadoop.hbase.TableNotEnabledException: testTableExists
          testRunThriftServer[11](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): test timed out after 60000 milliseconds
          testRunThriftServer[13](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): test timed out after 60000 milliseconds
        

        There was one hanging test:

        	at org.apache.hadoop.hbase.replication.TestReplication.setUp(TestReplication.java:183)
        

        BTW what do Ri, C and Fi represent in the formula above ?

        Show
        Ted Yu added a comment - - edited I ran test suite with latest patch on trunk and got: Failed tests: testRunThriftServer[12](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:<1> but was:<0> testRunThriftServer[14](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:<1> but was:<0> testRunThriftServer[15](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:<1> but was:<0> testRunThriftServer[16](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:<1> but was:<0> testRunThriftServer[17](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:<1> but was:<0> Tests in error: testRegionCaching(org.apache.hadoop.hbase.client.TestHCM): org.apache.hadoop.hbase.UnknownRegionException: bd992463917ba68fe5389c5bf9e94a3a testCloseRegionThatFetchesTheHRIFromMeta(org.apache.hadoop.hbase.client.TestAdmin): -1 testTableExists(org.apache.hadoop.hbase.catalog.TestMetaReaderEditor): org.apache.hadoop.hbase.TableNotEnabledException: testTableExists testRunThriftServer[11](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): test timed out after 60000 milliseconds testRunThriftServer[13](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): test timed out after 60000 milliseconds There was one hanging test: at org.apache.hadoop.hbase.replication.TestReplication.setUp(TestReplication.java:183) BTW what do R i , C and F i represent in the formula above ?
        Hide
        Lars Hofhansl added a comment -

        didn't pick up on the "was"

        Show
        Lars Hofhansl added a comment - didn't pick up on the "was"
        Hide
        Aditya Kishore added a comment -

        @Lars

        Completely agree and definitely would not want to hold 0.94.1 for this. (That's why "My vote was... ).

        Documentation can take care of this in 0.94.1

        Show
        Aditya Kishore added a comment - @Lars Completely agree and definitely would not want to hold 0.94.1 for this. (That's why "My vote was ... ). Documentation can take care of this in 0.94.1
        Hide
        Lars Hofhansl added a comment - - edited

        @Aditya: I do agree. (see my comment about how I'm sure the logic of this change is correct).

        It now seems, though, that it is the default timeout that is too short (4.5s).
        Folks with 5k regions should know to increase the minToStart parameter and the timeout. We should document that better.
        I can also see to change the timeout to failure condition (as discussed above).

        I'm not opposed. It's just that 0.94.1 needs to go out because of HBASE-6311, I do not want to risk delaying this further. It also seems this can use further discussion.
        (Sometimes it is amazing how much discussion a two line change can cause )

        @Ted and @Stack: What do you guys think?

        Edit: Spelling.

        Show
        Lars Hofhansl added a comment - - edited @Aditya: I do agree. (see my comment about how I'm sure the logic of this change is correct). It now seems, though, that it is the default timeout that is too short (4.5s). Folks with 5k regions should know to increase the minToStart parameter and the timeout. We should document that better. I can also see to change the timeout to failure condition (as discussed above). I'm not opposed. It's just that 0.94.1 needs to go out because of HBASE-6311 , I do not want to risk delaying this further. It also seems this can use further discussion. (Sometimes it is amazing how much discussion a two line change can cause ) @Ted and @Stack: What do you guys think? Edit: Spelling.
        Hide
        Aditya Kishore added a comment -

        My vote was for its inclusion for 2 reasons.

        1. This was a behavior change in 0.94.0 and I am not sure we have completely understood its impact.
        2. In a large MSLAB enabled cluster, I have repeatedly seen all the regions (in excess of 5K with Σi=1..n(RiCFi) > 8K; with MSLAB on, RS needs > 16G just to open) being assigned to a single region server leading it to OOM crash and creating quite a few HBCK inconsistencies on subsequent recovery.

        Lastly, so far all the test failures seems to be due to errors in the test code unmasked by this change.

        Show
        Aditya Kishore added a comment - My vote was for its inclusion for 2 reasons. This was a behavior change in 0.94.0 and I am not sure we have completely understood its impact. In a large MSLAB enabled cluster, I have repeatedly seen all the regions (in excess of 5K with Σ i=1..n ( R i CF i ) > 8K; with MSLAB on, RS needs > 16G just to open) being assigned to a single region server leading it to OOM crash and creating quite a few HBCK inconsistencies on subsequent recovery. Lastly, so far all the test failures seems to be due to errors in the test code unmasked by this change.
        Hide
        Lars Hofhansl added a comment -

        I'd like to leave this with 0.94.2. Unless you think this must go into 0.94.1

        Show
        Lars Hofhansl added a comment - I'd like to leave this with 0.94.2. Unless you think this must go into 0.94.1
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12537258/org.apache.hadoop.hbase.TestZooKeeper-output.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 10 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2415//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537258/org.apache.hadoop.hbase.TestZooKeeper-output.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 10 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2415//console This message is automatically generated.
        Hide
        stack added a comment -

        @Aditya Makes sense. You got what you needed from Ted? Let us know. Thanks.

        Show
        stack added a comment - @Aditya Makes sense. You got what you needed from Ted? Let us know. Thanks.
        Hide
        Ted Yu added a comment -

        Looking at https://builds.apache.org/job/PreCommit-HBASE-Build/2406/console, there was still some hanging test although I wasn't able to find which test hung.

        Show
        Ted Yu added a comment - Looking at https://builds.apache.org/job/PreCommit-HBASE-Build/2406/console , there was still some hanging test although I wasn't able to find which test hung.
        Hide
        Aditya Kishore added a comment -

        @Stack

        No, the current patch does not modify the way a live RS is evaluated, but it ensures that the dying RS's thread is actually dead before moving forward.

        What is the below changing doing?

        conf.setInt("hbase.master.wait.on.regionservers.mintostart", numSlaves);
        conf.setInt("hbase.master.wait.on.regionservers.maxtostart", numSlaves);
        + String count = String.valueOf(numSlaves);
        + conf.setIfUnset("hbase.master.wait.on.regionservers.mintostart", count);
        + conf.setIfUnset("hbase.master.wait.on.regionservers.maxtostart", count);

        This change was to preserve the values of 'mintostart' and 'maxtostart' in the configuration if the caller of HBaseTestingUtility.startMiniHBaseCluster(int, int) has set them (which was the case with TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS failure).

        Show
        Aditya Kishore added a comment - @Stack No, the current patch does not modify the way a live RS is evaluated, but it ensures that the dying RS's thread is actually dead before moving forward. What is the below changing doing? conf.setInt("hbase.master.wait.on.regionservers.mintostart", numSlaves); conf.setInt("hbase.master.wait.on.regionservers.maxtostart", numSlaves); + String count = String.valueOf(numSlaves); + conf.setIfUnset("hbase.master.wait.on.regionservers.mintostart", count); + conf.setIfUnset("hbase.master.wait.on.regionservers.maxtostart", count); This change was to preserve the values of 'mintostart' and 'maxtostart' in the configuration if the caller of HBaseTestingUtility.startMiniHBaseCluster(int, int) has set them (which was the case with TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS failure).
        Hide
        Ted Yu added a comment -

        Here was the test output from yesterday.

        Show
        Ted Yu added a comment - Here was the test output from yesterday.
        Hide
        Aditya Kishore added a comment -

        Unfortunately, even after repeated attempts, I am not able to fail the test after applying the last patch. But I do have a theory.

        Could you please test the last patch once again with debug logging enabled and send me the log.

        Show
        Aditya Kishore added a comment - Unfortunately, even after repeated attempts, I am not able to fail the test after applying the last patch. But I do have a theory. Could you please test the last patch once again with debug logging enabled and send me the log.
        Hide
        stack added a comment -

        @Aditya Thanks for the debugging that went into your figuring the above.

        The precondition of each test evaluates and ensure than a minimum of two region servers are online (by testing if their threads are "alive" and not by testing their ZK node or connecting to it).

        Does this patch change how we evaluate "alive" regionservers? (If not should, given your debug above, it seems like a good change for HTU).

        What is the below changing doing?

        
        
        • conf.setInt("hbase.master.wait.on.regionservers.mintostart", numSlaves);
        • conf.setInt("hbase.master.wait.on.regionservers.maxtostart", numSlaves);
          + String count = String.valueOf(numSlaves);
          + conf.setIfUnset("hbase.master.wait.on.regionservers.mintostart", count);
          + conf.setIfUnset("hbase.master.wait.on.regionservers.maxtostart", count);
          {code]

        Thanks.

        Show
        stack added a comment - @Aditya Thanks for the debugging that went into your figuring the above. The precondition of each test evaluates and ensure than a minimum of two region servers are online (by testing if their threads are "alive" and not by testing their ZK node or connecting to it). Does this patch change how we evaluate "alive" regionservers? (If not should, given your debug above, it seems like a good change for HTU). What is the below changing doing? conf.setInt("hbase.master.wait.on.regionservers.mintostart", numSlaves); conf.setInt("hbase.master.wait.on.regionservers.maxtostart", numSlaves); + String count = String.valueOf(numSlaves); + conf.setIfUnset("hbase.master.wait.on.regionservers.mintostart", count); + conf.setIfUnset("hbase.master.wait.on.regionservers.maxtostart", count); {code] Thanks.
        Hide
        Ted Yu added a comment -

        Thanks for your explanation.

        Have you seen the test failure that I described above @ 19/Jul/12 03:34 ?

        Show
        Ted Yu added a comment - Thanks for your explanation. Have you seen the test failure that I described above @ 19/Jul/12 03:34 ?
        Hide
        Aditya Kishore added a comment -

        @Ted

        The last patch address the exact issue listed at https://issues.apache.org/jira/browse/HBASE-6406?focusedCommentId=13417665&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13417665

        What had happened is that in most test runs, the test testRegionServerSessionExpired() get launched before testMasterSessionExpired() or testMasterZKSessionRecoveryFailure(). This test testRegionServerSessionExpired() brings down one of the two region servers but this RS is not dead by the time testMasterSessionExpired() or testMasterZKSessionRecoveryFailure() starts.

        The precondition of each test evaluates and ensure than a minimum of two region servers are online (by testing if their threads are "alive" and not by testing their ZK node or connecting to it).

        So while one of the RS is shutting itself down (and its thread is still alive), and either of testMasterSessionExpired() or testMasterZKSessionRecoveryFailure() could start because the test precondition is satisfied. However, both of these test cases result in Master recovery and reinitialization which actually attempts to check for the quorum of 2 Online region servers and since there is only one region server online at this point, the initialization fails with timeout and the master is killed.

        By this time the dying region server's thread is dead and the precondition of the next test sees that it needs to create one region server. But since no master is running at this point, the newly created region server's run thread gets blocked in HRegionServer.blockAndCheckIfStopped() and the RS does not come online. As a result the test thread which is waiting for the RS to come online keeps waiting which is why you see the test hung in setup().

        My last patch ensured that the dying RS is completely stopped before testRegionServerSessionExpired() completes so that the subsequent tests' precondition does not get fooled into thinking that the minimum server count is met and start the testcase.

        Show
        Aditya Kishore added a comment - @Ted The last patch address the exact issue listed at https://issues.apache.org/jira/browse/HBASE-6406?focusedCommentId=13417665&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13417665 What had happened is that in most test runs, the test testRegionServerSessionExpired() get launched before testMasterSessionExpired() or testMasterZKSessionRecoveryFailure(). This test testRegionServerSessionExpired() brings down one of the two region servers but this RS is not dead by the time testMasterSessionExpired() or testMasterZKSessionRecoveryFailure() starts. The precondition of each test evaluates and ensure than a minimum of two region servers are online (by testing if their threads are "alive" and not by testing their ZK node or connecting to it). So while one of the RS is shutting itself down (and its thread is still alive), and either of testMasterSessionExpired() or testMasterZKSessionRecoveryFailure() could start because the test precondition is satisfied. However, both of these test cases result in Master recovery and reinitialization which actually attempts to check for the quorum of 2 Online region servers and since there is only one region server online at this point, the initialization fails with timeout and the master is killed. By this time the dying region server's thread is dead and the precondition of the next test sees that it needs to create one region server. But since no master is running at this point, the newly created region server's run thread gets blocked in HRegionServer.blockAndCheckIfStopped() and the RS does not come online. As a result the test thread which is waiting for the RS to come online keeps waiting which is why you see the test hung in setup(). My last patch ensured that the dying RS is completely stopped before testRegionServerSessionExpired() completes so that the subsequent tests' precondition does not get fooled into thinking that the minimum server count is met and start the testcase.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #99 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/99/)
        HBASE-6389 Revert, the patch breaks TestZooKeeper (Revision 1363179)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #99 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/99/ ) HBASE-6389 Revert, the patch breaks TestZooKeeper (Revision 1363179) Result = FAILURE tedyu : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Hide
        Nicolas Liochon added a comment -

        What's the general thought here? Reapply the patch? The timeout after the patch is not really a timeout any longer, so at the very least the config option is misnamed now. Or in different words: What is the point of this timeout now?

        After some thinking about it:

        • if you want the exact same behavior as proposed by the patch you wan set hbase.master.wait.on.regionservers.timeout and hbase.master.wait.on.regionservers.interval to a high value
        • if you want to fix the problem mentioned in this patch we could solve it by throwing an exception when hbase.master.wait.on.regionservers.timeout is reached. This would require to increase the default as Aditya said it above.

        On a side note, I wonder what would be the impact when you're loosing a large part of your servers. Someone setting this value to its exact number of nodes will be disappointed.

        Show
        Nicolas Liochon added a comment - What's the general thought here? Reapply the patch? The timeout after the patch is not really a timeout any longer, so at the very least the config option is misnamed now. Or in different words: What is the point of this timeout now? After some thinking about it: if you want the exact same behavior as proposed by the patch you wan set hbase.master.wait.on.regionservers.timeout and hbase.master.wait.on.regionservers.interval to a high value if you want to fix the problem mentioned in this patch we could solve it by throwing an exception when hbase.master.wait.on.regionservers.timeout is reached. This would require to increase the default as Aditya said it above. On a side note, I wonder what would be the impact when you're loosing a large part of your servers. Someone setting this value to its exact number of nodes will be disappointed.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12537127/HBASE-6389_trunk.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings).

        -1 findbugs. The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2406//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2406//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2406//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2406//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2406//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2406//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2406//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537127/HBASE-6389_trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2406//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2406//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2406//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2406//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2406//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2406//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2406//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        It has been 12 minutes since I started running TestZooKeeper based on latest patch.
        Here is the tail of jstack:

        "main" prio=5 tid=102801000 nid=0x100601000 in Object.wait() [1005fe000]
           java.lang.Thread.State: WAITING (on object monitor)
        	at java.lang.Object.wait(Native Method)
        	- waiting on <78ebfee68> (a org.apache.hadoop.hbase.util.JVMClusterUtil$RegionServerThread)
        	at java.lang.Thread.join(Thread.java:1210)
        	- locked <78ebfee68> (a org.apache.hadoop.hbase.util.JVMClusterUtil$RegionServerThread)
        	at java.lang.Thread.join(Thread.java:1263)
        	at org.apache.hadoop.hbase.LocalHBaseCluster.waitOnRegionServer(LocalHBaseCluster.java:262)
        	at org.apache.hadoop.hbase.MiniHBaseCluster.waitOnRegionServer(MiniHBaseCluster.java:285)
        	at org.apache.hadoop.hbase.TestZooKeeper.testRegionServerSessionExpired(TestZooKeeper.java:201)
        

        Please take a look at:
        https://issues.apache.org/jira/browse/HBASE-6406?focusedCommentId=13417665&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13417665

        See if your finding can explain that symptom.

        Show
        Ted Yu added a comment - It has been 12 minutes since I started running TestZooKeeper based on latest patch. Here is the tail of jstack: "main" prio=5 tid=102801000 nid=0x100601000 in Object .wait() [1005fe000] java.lang. Thread .State: WAITING (on object monitor) at java.lang. Object .wait(Native Method) - waiting on <78ebfee68> (a org.apache.hadoop.hbase.util.JVMClusterUtil$RegionServerThread) at java.lang. Thread .join( Thread .java:1210) - locked <78ebfee68> (a org.apache.hadoop.hbase.util.JVMClusterUtil$RegionServerThread) at java.lang. Thread .join( Thread .java:1263) at org.apache.hadoop.hbase.LocalHBaseCluster.waitOnRegionServer(LocalHBaseCluster.java:262) at org.apache.hadoop.hbase.MiniHBaseCluster.waitOnRegionServer(MiniHBaseCluster.java:285) at org.apache.hadoop.hbase.TestZooKeeper.testRegionServerSessionExpired(TestZooKeeper.java:201) Please take a look at: https://issues.apache.org/jira/browse/HBASE-6406?focusedCommentId=13417665&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13417665 See if your finding can explain that symptom.
        Hide
        Lars Hofhansl added a comment -

        What's the general thought here? Reapply the patch?
        The timeout after the patch is not really a timeout any longer, so at the very least the config option is misnamed now.
        Or in different words: What is the point of this timeout now?

        Show
        Lars Hofhansl added a comment - What's the general thought here? Reapply the patch? The timeout after the patch is not really a timeout any longer, so at the very least the config option is misnamed now. Or in different words: What is the point of this timeout now?
        Hide
        Ted Yu added a comment - - edited

        @Aditya:
        Can you remove the sentence in Release Notes ?

        Thanks

        Show
        Ted Yu added a comment - - edited @Aditya: Can you remove the sentence in Release Notes ? Thanks
        Hide
        Aditya Kishore added a comment -

        As suspected, the failure was result of a partial shutdown of region server expired in the previous test.

        The region server thread of this expired server was still alive when the next test started which lead the test pre-condition to believe that enough number of RSes are available.

        The new patch take care of the problem.

        Show
        Aditya Kishore added a comment - As suspected, the failure was result of a partial shutdown of region server expired in the previous test. The region server thread of this expired server was still alive when the next test started which lead the test pre-condition to believe that enough number of RSes are available. The new patch take care of the problem.
        Hide
        Lars Hofhansl added a comment -

        I'm now not even sure we need to change this.
        One can already control this by setting the timeout to a large value (minutes) and then set minToStart to the desired value... That's what the timeout is for anyway.

        Show
        Lars Hofhansl added a comment - I'm now not even sure we need to change this. One can already control this by setting the timeout to a large value (minutes) and then set minToStart to the desired value... That's what the timeout is for anyway.
        Hide
        Lars Hofhansl added a comment -

        And... We actually got the first successful test run since July 13th!

        Show
        Lars Hofhansl added a comment - And... We actually got the first successful test run since July 13th!
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #337 (See https://builds.apache.org/job/HBase-0.94/337/)
        HBASE-6389 revert (Revision 1363193)

        Result = SUCCESS
        larsh :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #337 (See https://builds.apache.org/job/HBase-0.94/337/ ) HBASE-6389 revert (Revision 1363193) Result = SUCCESS larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Hide
        Lars Hofhansl added a comment -

        The change is logically no longer in 0.94.1. Moving to 0.94.2.

        Show
        Lars Hofhansl added a comment - The change is logically no longer in 0.94.1. Moving to 0.94.2.
        Hide
        Lars Hofhansl added a comment -

        Reverted from 0.94.
        Let's regroup.

        Show
        Lars Hofhansl added a comment - Reverted from 0.94. Let's regroup.
        Hide
        Lars Hofhansl added a comment -

        I think this patch indeed needs more discussion. If does not want the timeout one can in fact just increase it, and hence force the HMaster to wait for the minimum count.

        I'll revert in 0.94 and sink the RC.

        Show
        Lars Hofhansl added a comment - I think this patch indeed needs more discussion. If does not want the timeout one can in fact just increase it, and hence force the HMaster to wait for the minimum count. I'll revert in 0.94 and sink the RC.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #3145 (See https://builds.apache.org/job/HBase-TRUNK/3145/)
        HBASE-6389 Revert, the patch breaks TestZooKeeper (Revision 1363179)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #3145 (See https://builds.apache.org/job/HBase-TRUNK/3145/ ) HBASE-6389 Revert, the patch breaks TestZooKeeper (Revision 1363179) Result = FAILURE tedyu : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Hide
        Aditya Kishore added a comment -

        I am looking at the test failure and so far it looks like error in test condition than the patch code, will update as soon as I have something more.

        Show
        Aditya Kishore added a comment - I am looking at the test failure and so far it looks like error in test condition than the patch code, will update as soon as I have something more.
        Hide
        Lars Hofhansl added a comment -

        Trunk build 3126 was on July 14th. We also had no successful 0.94 since then.
        So while I do not think this patch is incorrect, I agree it should be reverted for now to work out why it causes these problems.

        Show
        Lars Hofhansl added a comment - Trunk build 3126 was on July 14th. We also had no successful 0.94 since then. So while I do not think this patch is incorrect, I agree it should be reverted for now to work out why it causes these problems.
        Hide
        Lars Hofhansl added a comment -

        The change is sound, though. I looked at the logic again and I can see no flaw in it.
        Maybe the tests expect a certain (incorrect) behavior?

        Show
        Lars Hofhansl added a comment - The change is sound, though. I looked at the logic again and I can see no flaw in it. Maybe the tests expect a certain (incorrect) behavior?
        Hide
        Ted Yu added a comment -

        Reverted trunk patch.

        Have not touched 0.94 branch yet.

        Show
        Ted Yu added a comment - Reverted trunk patch. Have not touched 0.94 branch yet.
        Hide
        Ted Yu added a comment -

        Patch for 0.94 wasn't attached here.

        @Lars:
        Can you revert the patches ?

        Thanks

        Show
        Ted Yu added a comment - Patch for 0.94 wasn't attached here. @Lars: Can you revert the patches ? Thanks
        Hide
        Lars Hofhansl added a comment -

        Thanks Ted. Revert the patch? Sink the RC?

        Show
        Lars Hofhansl added a comment - Thanks Ted. Revert the patch? Sink the RC?
        Hide
        Ted Yu added a comment -

        After reverting the patch, test passed smoothly:

        Running org.apache.hadoop.hbase.TestZooKeeper
        Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 48.678 sec
        
        Results :
        
        Tests run: 11, Failures: 0, Errors: 0, Skipped: 0
        
        [INFO] 
        [INFO] --- maven-surefire-plugin:2.10:test (secondPartTestsExecution) @ hbase-server ---
        [INFO] Tests are skipped.
        [INFO] ------------------------------------------------------------------------
        [INFO] BUILD SUCCESS
        [INFO] ------------------------------------------------------------------------
        [INFO] Total time: 58.563s
        
        Show
        Ted Yu added a comment - After reverting the patch, test passed smoothly: Running org.apache.hadoop.hbase.TestZooKeeper Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 48.678 sec Results : Tests run: 11, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] --- maven-surefire-plugin:2.10:test (secondPartTestsExecution) @ hbase-server --- [INFO] Tests are skipped. [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 58.563s
        Hide
        Ted Yu added a comment -

        I tried to see why TestZooKeeper hung strangely:

        2012-07-18 14:05:59,533 DEBUG [pool-57-thread-1] zookeeper.ZKUtil(1142): master:52861-0x1389be8bd6e0000-0x1389be8bd6e000a-0x1389be8bd6e000b Retrieved 39 byte(s) of data from znode /hbase/root-region-server and set watcher; X.ebay.com,44052,1342645522433
        
        2012-07-18 14:05:59,533 WARN  [pool-52-thread-1] zookeeper.RecoverableZooKeeper(218): Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server
        
        2012-07-18 14:05:59,533 INFO  [pool-52-thread-1] util.RetryCounter(55): Sleeping 2000ms before retry #1...
        
        2012-07-18 14:05:59,536 INFO  [main] ipc.HBaseRpcMetrics(66): Initializing RPC Metrics with hostName=MiniHBaseCluster$MiniHBaseClusterRegionServer, port=44030
        
        2012-07-18 14:05:59,537 INFO  [Master:0;X.ebay.com,52861,1342645522110] master.HMaster(455): HMaster main thread exiting
        

        Basically the test hung in setup().

        I then traced where TestZooKeeper stopped showing up in test result and this was the first URL giving me 404 error:

        https://builds.apache.org/view/G-L/view/HBase/job/HBase-TRUNK/3126/testReport/org.apache.hadoop.hbase/TestZooKeeper/

        That was when this patch went in.

        Show
        Ted Yu added a comment - I tried to see why TestZooKeeper hung strangely: 2012-07-18 14:05:59,533 DEBUG [pool-57-thread-1] zookeeper.ZKUtil(1142): master:52861-0x1389be8bd6e0000-0x1389be8bd6e000a-0x1389be8bd6e000b Retrieved 39 byte (s) of data from znode /hbase/root-region-server and set watcher; X.ebay.com,44052,1342645522433 2012-07-18 14:05:59,533 WARN [pool-52-thread-1] zookeeper.RecoverableZooKeeper(218): Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server 2012-07-18 14:05:59,533 INFO [pool-52-thread-1] util.RetryCounter(55): Sleeping 2000ms before retry #1... 2012-07-18 14:05:59,536 INFO [main] ipc.HBaseRpcMetrics(66): Initializing RPC Metrics with hostName=MiniHBaseCluster$MiniHBaseClusterRegionServer, port=44030 2012-07-18 14:05:59,537 INFO [Master:0;X.ebay.com,52861,1342645522110] master.HMaster(455): HMaster main thread exiting Basically the test hung in setup(). I then traced where TestZooKeeper stopped showing up in test result and this was the first URL giving me 404 error: https://builds.apache.org/view/G-L/view/HBase/job/HBase-TRUNK/3126/testReport/org.apache.hadoop.hbase/TestZooKeeper/ That was when this patch went in.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #94 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/94/)
        HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1361456)

        Result = FAILURE
        larsh :
        Files :

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #94 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/94/ ) HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1361456) Result = FAILURE larsh : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #3126 (See https://builds.apache.org/job/HBase-TRUNK/3126/)
        HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1361456)

        Result = FAILURE
        larsh :
        Files :

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #3126 (See https://builds.apache.org/job/HBase-TRUNK/3126/ ) HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1361456) Result = FAILURE larsh : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #316 (See https://builds.apache.org/job/HBase-0.94/316/)
        HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1361458)

        Result = FAILURE
        larsh :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #316 (See https://builds.apache.org/job/HBase-0.94/316/ ) HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1361458) Result = FAILURE larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #41 (See https://builds.apache.org/job/HBase-0.94-security/41/)
        HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1361458)

        Result = FAILURE
        larsh :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #41 (See https://builds.apache.org/job/HBase-0.94-security/41/ ) HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments (Aditya Kishore) (Revision 1361458) Result = FAILURE larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenMasterInitializing.java
        Hide
        Lars Hofhansl added a comment -

        Committed to 0.94 and 0.96

        Show
        Lars Hofhansl added a comment - Committed to 0.94 and 0.96
        Hide
        Lars Hofhansl added a comment -

        Ran TestHLogRecordReader locally. Passes fine (I did not expect that to be related to this patch).

        Show
        Lars Hofhansl added a comment - Ran TestHLogRecordReader locally. Passes fine (I did not expect that to be related to this patch).
        Hide
        Lars Hofhansl added a comment -

        +1 on last patch.
        If there are no objections I'll commit this to 0.94 and 0.96.

        Let's discuss the failure after timeout idea in a different jira.

        Show
        Lars Hofhansl added a comment - +1 on last patch. If there are no objections I'll commit this to 0.94 and 0.96. Let's discuss the failure after timeout idea in a different jira.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12536453/HBASE-6389_trunk.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings).

        -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.mapreduce.TestHLogRecordReader

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2384//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2384//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12536453/HBASE-6389_trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestHLogRecordReader Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2384//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2384//console This message is automatically generated.
        Hide
        Aditya Kishore added a comment -

        The test failure were result of masked error in test code which this change brought out.

        There were two such errors.

        1. The function org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster() was overriding the value of 'mintostart' and 'maxtostart' with a single value, even if the caller has set them explicitly.
        2. org.apache.hadoop.hbase.regionserver.TestRSKilledWhenMasterInitializing did not set these values even though it kills one RS during master initialization.

        The attached patch fixes these two.

        Show
        Aditya Kishore added a comment - The test failure were result of masked error in test code which this change brought out. There were two such errors. The function org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster() was overriding the value of 'mintostart' and 'maxtostart' with a single value, even if the caller has set them explicitly. org.apache.hadoop.hbase.regionserver.TestRSKilledWhenMasterInitializing did not set these values even though it kills one RS during master initialization. The attached patch fixes these two.
        Hide
        Aditya Kishore added a comment -

        I like the idea of treating timeout as error case and if we do decide on that, two things need to be taken care of.

        1. The current default timeout of 4.5 sec may not be appropriate and may require upward revision (to the tune of few minutes), and
        2. The master would need to do a cluster shutdown including other standby masters, otherwise each standby master may continue after the previous one has given up. In the worst case scenario of this case, if somehow 'minToStart' number of RSes join the last master, the cluster may be left with no standby master.

        For this JIRA, I would like to revert to the original behavior (until 0.92) of Master of waiting for 'minToStart' number of RSes.

        Show
        Aditya Kishore added a comment - I like the idea of treating timeout as error case and if we do decide on that, two things need to be taken care of. The current default timeout of 4.5 sec may not be appropriate and may require upward revision (to the tune of few minutes), and The master would need to do a cluster shutdown including other standby masters, otherwise each standby master may continue after the previous one has given up. In the worst case scenario of this case, if somehow 'minToStart' number of RSes join the last master, the cluster may be left with no standby master. For this JIRA, I would like to revert to the original behavior (until 0.92) of Master of waiting for 'minToStart' number of RSes.
        Hide
        Nicolas Liochon added a comment -

        We could remove the timeout? That would make things a little simpler.
        Or we could keep it as an error case, and throw an exception if the timeout is reached. The intend would be to stop the master.

        Show
        Nicolas Liochon added a comment - We could remove the timeout? That would make things a little simpler. Or we could keep it as an error case, and throw an exception if the timeout is reached. The intend would be to stop the master.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12536324/HBASE-6389_trunk.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings).

        -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.master.TestMasterFailover

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2379//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2379//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2379//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2379//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12536324/HBASE-6389_trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.master.TestMasterFailover Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2379//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2379//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2379//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2379//console This message is automatically generated.
        Hide
        Aditya Kishore added a comment -

        The test failure is related to this patch, I am looking into this.

        Show
        Aditya Kishore added a comment - The test failure is related to this patch, I am looking into this.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12536317/HBASE-6389_trunk.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings).

        -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.master.TestMasterFailover

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2377//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2377//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12536317/HBASE-6389_trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.master.TestMasterFailover Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2377//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2377//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2377//console This message is automatically generated.
        Hide
        Aditya Kishore added a comment -

        Modified patch.

        Show
        Aditya Kishore added a comment - Modified patch.
        Hide
        Ted Yu added a comment -

        Several variables are no longer final but I only see this extra assignment:

        +        maxToStart = minToStart;
        

        It would be nice to keep other variables final.

        Show
        Ted Yu added a comment - Several variables are no longer final but I only see this extra assignment: + maxToStart = minToStart; It would be nice to keep other variables final.
        Hide
        Lars Hofhansl added a comment -

        +1 on patch. I'd add a warning (or at least an info) if maxToStart < minToStart.
        Man, it's hard to wrap the head around this logic and ensure it's correct.

        Show
        Lars Hofhansl added a comment - +1 on patch. I'd add a warning (or at least an info) if maxToStart < minToStart. Man, it's hard to wrap the head around this logic and ensure it's correct.
        Hide
        Aditya Kishore added a comment -

        Also, please note that HBASE-4993 has been addressed by the other change alone by setting maxToStart = minToStart before launching the mini cluster. So this change does not undo HBASE-4993

        Show
        Aditya Kishore added a comment - Also, please note that HBASE-4993 has been addressed by the other change alone by setting maxToStart = minToStart before launching the mini cluster. So this change does not undo HBASE-4993
        Hide
        Aditya Kishore added a comment -

        Apart from the main issue, this patch ensures "maxToStart >= minToStart" in case of incorrect configuration.

        Show
        Aditya Kishore added a comment - Apart from the main issue, this patch ensures "maxToStart >= minToStart" in case of incorrect configuration.
        Hide
        Lars Hofhansl added a comment -

        I'll hold 0.94.1 for this.

        Show
        Lars Hofhansl added a comment - I'll hold 0.94.1 for this.
        Hide
        Aditya Kishore added a comment -

        Patch for trunk.

        Show
        Aditya Kishore added a comment - Patch for trunk.
        Hide
        Aditya Kishore added a comment -

        With the changed condition, Master will wait until one of the following conditions is met

        1. The master is stopped.
        2. The 'hbase.master.wait.on.regionservers.maxtostart' number of region servers is reached.
        3. The 'hbase.master.wait.on.regionservers.mintostart' is reached
          AND
          No new region server checks-in for 'hbase.master.wait.on.regionservers.interval' time.
          AND
          The 'hbase.master.wait.on.regionservers.timeout' is reached.
        Show
        Aditya Kishore added a comment - With the changed condition, Master will wait until one of the following conditions is met The master is stopped. The 'hbase.master.wait.on.regionservers.maxtostart' number of region servers is reached. The 'hbase.master.wait.on.regionservers.mintostart' is reached AND No new region server checks-in for 'hbase.master.wait.on.regionservers.interval' time. AND The 'hbase.master.wait.on.regionservers.timeout' is reached.

          People

          • Assignee:
            Aditya Kishore
            Reporter:
            Aditya Kishore
          • Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development