HBase
  1. HBase
  2. HBASE-5603

rolling-restart.sh script hangs when attempting to detect expiration of /hbase/master znode.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.92.0, 0.94.0, 0.95.2
    • Fix Version/s: 0.94.0, 0.95.0
    • Component/s: Zookeeper
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Due to bugfix ZOOKEEPER-1059 (ZK 3.4.0+), the rolling-restart.sh script will hang when attempting to make sure the /hbase/master znode is deleted.

      Here's the code

      # make sure the master znode has been deleted before continuing
          zparent=`$bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool zookeeper.znode.parent`
          if [ "$zparent" == "null" ]; then zparent="/hbase"; fi
          zmaster=`$bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool zookeeper.znode.master`
          if [ "$zmaster" == "null" ]; then zmaster="master"; fi
          zmaster=$zparent/$zmaster
          echo -n "Waiting for Master ZNode ${zmaster} to expire"
          while bin/hbase zkcli stat $zmaster >/dev/null 2>&1; do
            echo -n "."
            sleep 1
          done
          echo #force a newline
      

      Prior to ZOOKEEPER-1059, stat on a null znode would NPE and cause zkcli to exit with retcode 1. Afterwards, the null is caught, zkcli will exit with 0 in the case where the znode is present and in the case where it does not exist.

      1. HBASE-5603.patch
        0.5 kB
        Jonathan Hsieh

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2689 (See https://builds.apache.org/job/HBase-TRUNK/2689/)
          HBASE-5603 rolling-restart.sh script hangs when attempting to detect expiration of /hbase/master znode (Revision 1303184)

          Result = FAILURE
          jmhsieh :
          Files :

          • /hbase/trunk/bin/rolling-restart.sh
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2689 (See https://builds.apache.org/job/HBase-TRUNK/2689/ ) HBASE-5603 rolling-restart.sh script hangs when attempting to detect expiration of /hbase/master znode (Revision 1303184) Result = FAILURE jmhsieh : Files : /hbase/trunk/bin/rolling-restart.sh
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-security #144 (See https://builds.apache.org/job/HBase-TRUNK-security/144/)
          HBASE-5603 rolling-restart.sh script hangs when attempting to detect expiration of /hbase/master znode (Revision 1303184)

          Result = FAILURE
          jmhsieh :
          Files :

          • /hbase/trunk/bin/rolling-restart.sh
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-security #144 (See https://builds.apache.org/job/HBase-TRUNK-security/144/ ) HBASE-5603 rolling-restart.sh script hangs when attempting to detect expiration of /hbase/master znode (Revision 1303184) Result = FAILURE jmhsieh : Files : /hbase/trunk/bin/rolling-restart.sh
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #331 (See https://builds.apache.org/job/HBase-0.92/331/)
          HBASE-5603 rolling-restart.sh script hangs when attempting to detect expiration of /hbase/master znode (Revision 1303187)

          Result = SUCCESS
          jmhsieh :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/bin/rolling-restart.sh
          Show
          Hudson added a comment - Integrated in HBase-0.92 #331 (See https://builds.apache.org/job/HBase-0.92/331/ ) HBASE-5603 rolling-restart.sh script hangs when attempting to detect expiration of /hbase/master znode (Revision 1303187) Result = SUCCESS jmhsieh : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/bin/rolling-restart.sh
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #43 (See https://builds.apache.org/job/HBase-0.94/43/)
          HBASE-5603 rolling-restart.sh script hangs when attempting to detect expiration of /hbase/master znode (Revision 1303186)

          Result = SUCCESS
          jmhsieh :
          Files :

          • /hbase/branches/0.94/bin/rolling-restart.sh
          Show
          Hudson added a comment - Integrated in HBase-0.94 #43 (See https://builds.apache.org/job/HBase-0.94/43/ ) HBASE-5603 rolling-restart.sh script hangs when attempting to detect expiration of /hbase/master znode (Revision 1303186) Result = SUCCESS jmhsieh : Files : /hbase/branches/0.94/bin/rolling-restart.sh
          Hide
          Jonathan Hsieh added a comment -

          Commited. Thanks Lars and Ted.

          Show
          Jonathan Hsieh added a comment - Commited. Thanks Lars and Ted.
          Hide
          Jonathan Hsieh added a comment -

          Tested on 5 node cluster on 0.92.2-SNAPSHOT. The portion in question works as desired.

          There are other problems (hmm.. my ROOT is stuck-in-transition) but that is a different bug and separate issue. Further digging required there.

          Show
          Jonathan Hsieh added a comment - Tested on 5 node cluster on 0.92.2-SNAPSHOT. The portion in question works as desired. There are other problems (hmm.. my ROOT is stuck-in-transition) but that is a different bug and separate issue. Further digging required there.
          Hide
          Lars Hofhansl added a comment -

          You mean back porting HBASE-5314 to 0.94? That seems like a reasonable plan.

          Show
          Lars Hofhansl added a comment - You mean back porting HBASE-5314 to 0.94? That seems like a reasonable plan.
          Hide
          Ted Yu added a comment -

          What about HBASE-5314 ?
          I think that is a feature making bin/rolling-restart.sh more usable.

          Show
          Ted Yu added a comment - What about HBASE-5314 ? I think that is a feature making bin/rolling-restart.sh more usable.
          Hide
          Lars Hofhansl added a comment -

          Sounds like a good plan.

          Show
          Lars Hofhansl added a comment - Sounds like a good plan.
          Hide
          Jonathan Hsieh added a comment -

          I'm testing now.

          how about if it works, we'll put it in all 0.92/0.94/0.96/trunk, and have a separate jira to remove from 0.96 and follow up issues if desired.

          Show
          Jonathan Hsieh added a comment - I'm testing now. how about if it works, we'll put it in all 0.92/0.94/0.96/trunk, and have a separate jira to remove from 0.96 and follow up issues if desired.
          Hide
          Lars Hofhansl added a comment -

          @Ted: Let's fix the issue at hand in this jira.
          We can have another jira for the point you mention.

          Show
          Lars Hofhansl added a comment - @Ted: Let's fix the issue at hand in this jira. We can have another jira for the point you mention.
          Hide
          Lars Hofhansl added a comment -

          Is there somebody who can a quick test with this script and confirm that it works (in our cluster we do not use the HBase shell scripts, so I can't test this).

          Show
          Lars Hofhansl added a comment - Is there somebody who can a quick test with this script and confirm that it works (in our cluster we do not use the HBase shell scripts, so I can't test this).
          Hide
          Ted Yu added a comment -

          w.r.t. http://hbase.apache.org/book.html#rolling, I think we can add more details to it.
          e.g. see the following code snippet in bin/rolling-restart.sh:

              "$bin"/hbase-daemons.sh --config "${HBASE_CONF_DIR}" \
                --hosts "${HBASE_BACKUP_MASTERS}" stop master-backup
          

          bin/graceful_stop.sh is mentioned in http://hbase.apache.org/book.html#decommission so people may not intuitively associate it with rolling restart.

          Different companies have different practices w.r.t. rolling restart.
          Since rolling-restart.sh was recently enhanced to respect region placement:

          r1299983 | stack | 2012-03-12 23:30:15 -0700 (Mon, 12 Mar 2012) | 1 line
          
          HBASE-5314 racefully rolling restart region servers in rolling-restart.sh
          ------------------------------------------------------------------------
          

          I think we should put it in a useable form.

          Show
          Ted Yu added a comment - w.r.t. http://hbase.apache.org/book.html#rolling , I think we can add more details to it. e.g. see the following code snippet in bin/rolling-restart.sh: "$bin" /hbase-daemons.sh --config "${HBASE_CONF_DIR}" \ --hosts "${HBASE_BACKUP_MASTERS}" stop master-backup bin/graceful_stop.sh is mentioned in http://hbase.apache.org/book.html#decommission so people may not intuitively associate it with rolling restart. Different companies have different practices w.r.t. rolling restart. Since rolling-restart.sh was recently enhanced to respect region placement: r1299983 | stack | 2012-03-12 23:30:15 -0700 (Mon, 12 Mar 2012) | 1 line HBASE-5314 racefully rolling restart region servers in rolling-restart.sh ------------------------------------------------------------------------ I think we should put it in a useable form.
          Hide
          Lars Hofhansl added a comment -

          I'd say let's get this into 0.92 and 0.94 and then decide what to do in 0.96. If it's really useless, then remove it.

          Show
          Lars Hofhansl added a comment - I'd say let's get this into 0.92 and 0.94 and then decide what to do in 0.96. If it's really useless, then remove it.
          Hide
          Jonathan Hsieh added a comment -

          There was recently a patch committed to the script which makes me think there is at least someone using this. Would it be best to do a quick "fix" for 0.92 and decide to remove for 0.94+ or from 0.96+?

          Show
          Jonathan Hsieh added a comment - There was recently a patch committed to the script which makes me think there is at least someone using this. Would it be best to do a quick "fix" for 0.92 and decide to remove for 0.94+ or from 0.96+?
          Hide
          Jonathan Hsieh added a comment -

          I'm 0.5+ (not a huge fan) of doing grep in the long run since it can be brittle, but agree that it should solve the problem quickly.

          Show
          Jonathan Hsieh added a comment - I'm 0.5+ (not a huge fan) of doing grep in the long run since it can be brittle, but agree that it should solve the problem quickly.
          Hide
          Ted Yu added a comment -

          Patrick Hunt made similar suggestion. Allow me to quote him:

          You can look for "^Node does not exist" in the stat output instead of
          checking the exit code. This would get around the problem until a more
          permanent solution could be found.

          Show
          Ted Yu added a comment - Patrick Hunt made similar suggestion. Allow me to quote him: You can look for "^Node does not exist" in the stat output instead of checking the exit code. This would get around the problem until a more permanent solution could be found.
          Hide
          Lars Hofhansl added a comment -

          Could change

          while bin/hbase zkcli stat $zmaster >/dev/null 2>&1; do
          

          to

          while ! /bin/hbase/zkCli stat $zmaster 2>&1 | grep "Node does not exist" > /dev/null; do
          
          Show
          Lars Hofhansl added a comment - Could change while bin/hbase zkcli stat $zmaster >/dev/ null 2>&1; do to while ! /bin/hbase/zkCli stat $zmaster 2>&1 | grep "Node does not exist" > /dev/ null ; do
          Hide
          Lars Hofhansl added a comment -

          Looks like this is a blocker for 0.92 and 0.94

          Show
          Lars Hofhansl added a comment - Looks like this is a blocker for 0.92 and 0.94
          Hide
          Jonathan Hsieh added a comment -

          HBASE-2418 upgraded from the ZK 3.3.x to ZK 3.4.x

          Show
          Jonathan Hsieh added a comment - HBASE-2418 upgraded from the ZK 3.3.x to ZK 3.4.x

            People

            • Assignee:
              Jonathan Hsieh
              Reporter:
              Jonathan Hsieh
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development