HBase
  1. HBase
  2. HBASE-6122

Backup master does not become Active master after ZK exception

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.94.0
    • Fix Version/s: 0.92.2, 0.94.1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      -> Active master gets ZK expiry exception.
      -> Backup master becomes active.
      -> The previous active master retries and becomes the back up master.
      Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.

      if (abortNow(msg, t)) {
            if (t != null) LOG.fatal(msg, t);
            else LOG.fatal(msg);
            this.abort = true;
            stop("Aborting");
          }
      

      In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active.

          synchronized (this.clusterHasActiveMaster) {
            while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
              try {
                this.clusterHasActiveMaster.wait();
              } catch (InterruptedException e) {
                // We expect to be interrupted when a master dies, will fall out if so
                LOG.debug("Interrupted waiting for master to die", e);
              }
            }
            if (!clusterStatusTracker.isClusterUp()) {
              this.master.stop("Cluster went down before this master became active");
            }
            if (this.master.isStopped()) {
              return cleanSetOfActiveMaster;
            }
            // Try to become active master again now that there is no active master
            blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
          }
          return cleanSetOfActiveMaster;
      

      When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from

      // Try to become active master again now that there is no active master
            blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
      

      We tend to return the 'cleanSetOfActiveMaster' which was previously false.
      Now because of this instead of again becoming active the back up master goes down in the abort() code. Thanks to Gopi,my colleague for reporting this issue.

      1. HBASE-6122_0.94.patch
        0.7 kB
        ramkrishna.s.vasudevan
      2. HBASE-6122_0.92.patch
        0.7 kB
        ramkrishna.s.vasudevan
      3. HBASE-6122.patch
        2 kB
        ramkrishna.s.vasudevan
      4. HBASE-6122_0.94.patch
        2 kB
        ramkrishna.s.vasudevan

        Activity

        Hide
        Nicolas Liochon added a comment -

        Thanks, I will give it a try to be sure.

        Show
        Nicolas Liochon added a comment - Thanks, I will give it a try to be sure.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @N
        The trunk code is different. Currently there is a while(true) loop and as far as i see it should be ok in trunk.
        I did not try to reproduce in trunk.

        Show
        ramkrishna.s.vasudevan added a comment - @N The trunk code is different. Currently there is a while(true) loop and as far as i see it should be ok in trunk. I did not try to reproduce in trunk.
        Hide
        Nicolas Liochon added a comment -

        @ram

        I found some changes in the trunk code. So not sure if it is applicable in trunk. Attached patches for 0.94 and 0.92.

        Do you mean that the problem is not reproducible on trunk?

        Show
        Nicolas Liochon added a comment - @ram I found some changes in the trunk code. So not sure if it is applicable in trunk. Attached patches for 0.94 and 0.92. Do you mean that the problem is not reproducible on trunk?
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92-security #109 (See https://builds.apache.org/job/HBase-0.92-security/109/)
        HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344799)
        HBASE-6122 Backup master does not become Active master after ZK exception: REVERT (Revision 1344466)
        HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344350)

        Result = SUCCESS
        ramkrishna :
        Files :

        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

        stack :
        Files :

        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java

        ramkrishna :
        Files :

        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.92-security #109 (See https://builds.apache.org/job/HBase-0.92-security/109/ ) HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344799) HBASE-6122 Backup master does not become Active master after ZK exception: REVERT (Revision 1344466) HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344350) Result = SUCCESS ramkrishna : Files : /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java stack : Files : /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java ramkrishna : Files : /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #33 (See https://builds.apache.org/job/HBase-0.94-security/33/)
        HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344798)
        HBASE-6122 Backup master does not become Active master after ZK exception: REVERT (Revision 1344467)
        HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344348)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

        stack :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java

        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #33 (See https://builds.apache.org/job/HBase-0.94-security/33/ ) HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344798) HBASE-6122 Backup master does not become Active master after ZK exception: REVERT (Revision 1344467) HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344348) Result = FAILURE ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java stack : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92 #439 (See https://builds.apache.org/job/HBase-0.92/439/)
        HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344799)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
        Show
        Hudson added a comment - Integrated in HBase-0.92 #439 (See https://builds.apache.org/job/HBase-0.92/439/ ) HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344799) Result = FAILURE ramkrishna : Files : /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #240 (See https://builds.apache.org/job/HBase-0.94/240/)
        HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344798)

        Result = SUCCESS
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #240 (See https://builds.apache.org/job/HBase-0.94/240/ ) HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344798) Result = SUCCESS ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
        Hide
        ramkrishna.s.vasudevan added a comment -

        Committed again with the latest patch to 0.92 and 0.94. Hope things are ok this time. Thanks Stack for your review.

        Show
        ramkrishna.s.vasudevan added a comment - Committed again with the latest patch to 0.92 and 0.94. Hope things are ok this time. Thanks Stack for your review.
        Hide
        stack added a comment -

        Looks good Ram. +1

        Show
        stack added a comment - Looks good Ram. +1
        Hide
        ramkrishna.s.vasudevan added a comment - - edited

        I have attached the patch Stack. It is changing the assert of the testcase TestMasterZKSessionRecovery.testMasterZKSessionRecoveryFailure

        Show
        ramkrishna.s.vasudevan added a comment - - edited I have attached the patch Stack. It is changing the assert of the testcase TestMasterZKSessionRecovery.testMasterZKSessionRecoveryFailure
        Hide
        stack added a comment -

        @Ram Which assert should be changed? Do you want to include the assert change in your patch? Or are you suggesting a previous test case is broke? If so, which? Thanks.

        Show
        stack added a comment - @Ram Which assert should be changed? Do you want to include the assert change in your patch? Or are you suggesting a previous test case is broke? If so, which? Thanks.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Pls take a look at the patch. I have not corrected the test case for the testcase to pass. Ideally the previous test was covering up the bug. Correct me if am wrong.

        Show
        ramkrishna.s.vasudevan added a comment - Pls take a look at the patch. I have not corrected the test case for the testcase to pass. Ideally the previous test was covering up the bug. Correct me if am wrong.
        Hide
        ramkrishna.s.vasudevan added a comment - - edited

        I checked the test case.
        Ideally the flow is making the master to become active but the problem as described in this JIRA still makes the master to go down.

        I added a log in ActiveMasterManager.blockUntilBecomingActiveMaster

                LOG.info("Master is now available "+this.sn);
                this.clusterHasActiveMaster.set(true);
                LOG.info("Master=" + this.sn);
                return cleanSetOfActiveMaster;
        

        See the below log in the logs.

        2012-05-31 10:52:29,050 INFO  [pool-29-thread-1] master.ActiveMasterManager(149): Master is now available Htipl-01388.china.huawei.com,3569,1338441734226
        2012-05-31 10:52:29,050 INFO  [pool-29-thread-1] master.ActiveMasterManager(151): Master=Htipl-01388.china.huawei.com,3569,1338441734226
        

        This means ideally the master should come up if there is no problem in again becoming active. Along with the patch this testcase should be modified to make the assertTrue to assertFalse.

        Pls correct me if am wrong. The fix still remains valid.

        Show
        ramkrishna.s.vasudevan added a comment - - edited I checked the test case. Ideally the flow is making the master to become active but the problem as described in this JIRA still makes the master to go down. I added a log in ActiveMasterManager.blockUntilBecomingActiveMaster LOG.info( "Master is now available " + this .sn); this .clusterHasActiveMaster.set( true ); LOG.info( "Master=" + this .sn); return cleanSetOfActiveMaster; See the below log in the logs. 2012-05-31 10:52:29,050 INFO [pool-29-thread-1] master.ActiveMasterManager(149): Master is now available Htipl-01388.china.huawei.com,3569,1338441734226 2012-05-31 10:52:29,050 INFO [pool-29-thread-1] master.ActiveMasterManager(151): Master=Htipl-01388.china.huawei.com,3569,1338441734226 This means ideally the master should come up if there is no problem in again becoming active. Along with the patch this testcase should be modified to make the assertTrue to assertFalse. Pls correct me if am wrong. The fix still remains valid.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Oh... Let me check out the reason for the failure. Sorry for the mess.

        Show
        ramkrishna.s.vasudevan added a comment - Oh... Let me check out the reason for the failure. Sorry for the mess.
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92 #435 (See https://builds.apache.org/job/HBase-0.92/435/)
        HBASE-6122 Backup master does not become Active master after ZK exception: REVERT (Revision 1344466)

        Result = SUCCESS
        stack :
        Files :

        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.92 #435 (See https://builds.apache.org/job/HBase-0.92/435/ ) HBASE-6122 Backup master does not become Active master after ZK exception: REVERT (Revision 1344466) Result = SUCCESS stack : Files : /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #236 (See https://builds.apache.org/job/HBase-0.94/236/)
        HBASE-6122 Backup master does not become Active master after ZK exception: REVERT (Revision 1344467)

        Result = SUCCESS
        stack :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #236 (See https://builds.apache.org/job/HBase-0.94/236/ ) HBASE-6122 Backup master does not become Active master after ZK exception: REVERT (Revision 1344467) Result = SUCCESS stack : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        Hide
        stack added a comment -

        I reverted from 0.92 and 0.94 branches till we figure the failures.

        Show
        stack added a comment - I reverted from 0.92 and 0.94 branches till we figure the failures.
        Hide
        stack added a comment -

        Reopening. Backing out these patches. It seems reponsible for these failures:
        https://builds.apache.org/job/HBase-0.92/433/

        Show
        stack added a comment - Reopening. Backing out these patches. It seems reponsible for these failures: https://builds.apache.org/job/HBase-0.92/433/
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92 #433 (See https://builds.apache.org/job/HBase-0.92/433/)
        HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344350)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.92 #433 (See https://builds.apache.org/job/HBase-0.92/433/ ) HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344350) Result = FAILURE ramkrishna : Files : /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #233 (See https://builds.apache.org/job/HBase-0.94/233/)
        HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344348)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #233 (See https://builds.apache.org/job/HBase-0.94/233/ ) HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344348) Result = FAILURE ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
        Hide
        ramkrishna.s.vasudevan added a comment -

        Committed to 0.92 and 0.94.
        Thanks for the review Lars.

        Show
        ramkrishna.s.vasudevan added a comment - Committed to 0.92 and 0.94. Thanks for the review Lars.
        Hide
        Lars Hofhansl added a comment -

        +1 patch looks good to me.

        Show
        Lars Hofhansl added a comment - +1 patch looks good to me.
        Hide
        ramkrishna.s.vasudevan added a comment -

        I found some changes in the trunk code. So not sure if it is applicable in trunk. Attached patches for 0.94 and 0.92.

        Show
        ramkrishna.s.vasudevan added a comment - I found some changes in the trunk code. So not sure if it is applicable in trunk. Attached patches for 0.94 and 0.92.

          People

          • Assignee:
            ramkrishna.s.vasudevan
            Reporter:
            ramkrishna.s.vasudevan
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development