Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-6122

Backup master does not become Active master after ZK exception

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.94.0
    • 0.92.2, 0.94.1
    • None
    • None
    • Reviewed

    Description

      -> Active master gets ZK expiry exception.
      -> Backup master becomes active.
      -> The previous active master retries and becomes the back up master.
      Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.

      if (abortNow(msg, t)) {
            if (t != null) LOG.fatal(msg, t);
            else LOG.fatal(msg);
            this.abort = true;
            stop("Aborting");
          }
      

      In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active.

          synchronized (this.clusterHasActiveMaster) {
            while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
              try {
                this.clusterHasActiveMaster.wait();
              } catch (InterruptedException e) {
                // We expect to be interrupted when a master dies, will fall out if so
                LOG.debug("Interrupted waiting for master to die", e);
              }
            }
            if (!clusterStatusTracker.isClusterUp()) {
              this.master.stop("Cluster went down before this master became active");
            }
            if (this.master.isStopped()) {
              return cleanSetOfActiveMaster;
            }
            // Try to become active master again now that there is no active master
            blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
          }
          return cleanSetOfActiveMaster;
      

      When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from

      // Try to become active master again now that there is no active master
            blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
      

      We tend to return the 'cleanSetOfActiveMaster' which was previously false.
      Now because of this instead of again becoming active the back up master goes down in the abort() code. Thanks to Gopi,my colleague for reporting this issue.

      Attachments

        1. HBASE-6122_0.94.patch
          2 kB
          ramkrishna.s.vasudevan
        2. HBASE-6122.patch
          2 kB
          ramkrishna.s.vasudevan
        3. HBASE-6122_0.92.patch
          0.7 kB
          ramkrishna.s.vasudevan
        4. HBASE-6122_0.94.patch
          0.7 kB
          ramkrishna.s.vasudevan

        Activity

          People

            ram_krish ramkrishna.s.vasudevan
            ram_krish ramkrishna.s.vasudevan
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: