HBase
  1. HBase
  2. HBASE-4951

master process can not be stopped when it is initializing

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 0.90.3
    • Fix Version/s: 0.90.7
    • Component/s: master
    • Labels:
      None

      Description

      It is easy to reproduce by following step:
      step1:start master process.(do not start regionserver process in the cluster).
      the master will wait the regionserver to check in:
      org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) to checkin

      step2:stop the master by sh command bin/hbase master stop

      result:the master process will never die because catalogTracker.waitForRoot() method will block unitl the root region assigned.

      1. HBASE-4951_branch.patch
        6 kB
        ramkrishna.s.vasudevan
      2. HBASE-4951.patch
        2 kB
        ramkrishna.s.vasudevan

        Activity

        Hide
        Jean-Daniel Cryans added a comment -

        Unmarking patch available, more than a year old.

        Show
        Jean-Daniel Cryans added a comment - Unmarking patch available, more than a year old.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Updated the affected version to 0.90.7.

        Show
        ramkrishna.s.vasudevan added a comment - Updated the affected version to 0.90.7.
        Hide
        Ted Yu added a comment -

        Minor comments:

        +        if (LOG.isTraceEnabled()) {
        +          LOG.info(".META. still not available, sleeping and retrying." +
        

        Should LOG.isInfoEnabled() be called above ?

        The following two comments apply to ZooKeeperNodeTracker.java and CatalogTracker.java:

        +        // chk if the shutdown node is available. Because incase of cluster
        +        // shutdonw
        +        // this loop prevents the master from going down.
        

        The above should read: check if the shutdown node exists. In case of cluster shutdown, ...

        +              "Unexpected exception handling while checking if shutdown node exists.",
        

        Extra word above: handling

        Show
        Ted Yu added a comment - Minor comments: + if (LOG.isTraceEnabled()) { + LOG.info( ".META. still not available, sleeping and retrying." + Should LOG.isInfoEnabled() be called above ? The following two comments apply to ZooKeeperNodeTracker.java and CatalogTracker.java: + // chk if the shutdown node is available. Because incase of cluster + // shutdonw + // this loop prevents the master from going down. The above should read: check if the shutdown node exists. In case of cluster shutdown, ... + "Unexpected exception handling while checking if shutdown node exists." , Extra word above: handling
        Hide
        ramkrishna.s.vasudevan added a comment -

        Patch for branch0.90

        Show
        ramkrishna.s.vasudevan added a comment - Patch for branch0.90
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Xufeng
        Yes you are right.. the patch address only one problem where the master alone is started.
        We can check for the shutdown node while waiting for Root and if the shut down node does not exist we can come out of wait.
        Thanks for the finding.

        Show
        ramkrishna.s.vasudevan added a comment - @Xufeng Yes you are right.. the patch address only one problem where the master alone is started. We can check for the shutdown node while waiting for Root and if the shut down node does not exist we can come out of wait. Thanks for the finding.
        Hide
        xufeng added a comment -

        I think this problem is also exist in trunk by this patch.

        Show
        xufeng added a comment - I think this problem is also exist in trunk by this patch.
        Hide
        xufeng added a comment -

        I tested this patch in 0.90.
        It can not work in following scenarios:
        1.master startup,one regionserver startup.
        2.waitForRegionServers over and ok.
        3.run the bin/hbase master stop before root region be assigned.

        the bin/hbase master stop will stop the cluster,the regionserver will been killed first.
        The root region has no chance to be assigned successfully,it will block in catalogTracker.waitForRoot().

        Show
        xufeng added a comment - I tested this patch in 0.90. It can not work in following scenarios: 1.master startup,one regionserver startup. 2.waitForRegionServers over and ok. 3.run the bin/hbase master stop before root region be assigned. the bin/hbase master stop will stop the cluster,the regionserver will been killed first. The root region has no chance to be assigned successfully,it will block in catalogTracker.waitForRoot().
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12506976/HBASE-4951.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 javadoc. The javadoc tool appears to have generated -160 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 75 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.client.TestAdmin
        org.apache.hadoop.hbase.replication.TestReplication

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/488//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/488//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/488//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12506976/HBASE-4951.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -160 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 75 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestAdmin org.apache.hadoop.hbase.replication.TestReplication Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/488//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/488//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/488//console This message is automatically generated.
        Hide
        stack added a comment -

        Missing @return javadoc but its kinda strange I'd say having waitForRegionServers return whether master is stopped or not. Would suggest changing '+ if (masterStopped) {' into 'if (this.master.isStopped()) {' instead. Makes patch smaller too.

        Show
        stack added a comment - Missing @return javadoc but its kinda strange I'd say having waitForRegionServers return whether master is stopped or not. Would suggest changing '+ if (masterStopped) {' into 'if (this.master.isStopped()) {' instead. Makes patch smaller too.
        Hide
        Ted Yu added a comment -

        +1 on patch.

        Show
        Ted Yu added a comment - +1 on patch.
        Hide
        ramkrishna.s.vasudevan added a comment -

        This is for trunk.

        Show
        ramkrishna.s.vasudevan added a comment - This is for trunk.
        Hide
        stack added a comment -

        Ram, I made you an admin on JIRA. You should be able to edit comments. Resolve if you think appropriate.

        Show
        stack added a comment - Ram, I made you an admin on JIRA. You should be able to edit comments. Resolve if you think appropriate.
        Hide
        ramkrishna.s.vasudevan added a comment -

        The thread dump in 09/Dec/11 11:26 is not for this issue. Sorry. Pls edit the comment.

        Show
        ramkrishna.s.vasudevan added a comment - The thread dump in 09/Dec/11 11:26 is not for this issue. Sorry. Pls edit the comment.
        Hide
        ramkrishna.s.vasudevan added a comment -

        The problem is stopped variable is checked in HMaster.run()

              // We are either the active master or we were asked to shutdown
              if (!this.stopped) {
                finishInitialization(startupStatus);
                loop();
              }
        

        But in finishInitialization we wait for ROOT to assign and it is a timed wait in that. So though the master is stopped we dont get the chance to check the status of the stopped variable in master.

        The same happens with splitLogAfterStartup also.

        Attaching the thread dumps

        "master-linux76,60000,1323444760834" prio=10 tid=0x085cdc00 nid=0x593d in Object.wait() [0x6fa6f000..0x6fa6ff50]
           java.lang.Thread.State: TIMED_WAITING (on object monitor)
        	at java.lang.Object.wait(Native Method)
        	- waiting on <0x74112e28> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
        	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.blockUntilAvailable(ZooKeeperNodeTracker.java:132)
        	- locked <0x74112e28> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
        	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.blockUntilAvailable(ZooKeeperNodeTracker.java:104)
        	- locked <0x74112e28> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
        	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRoot(CatalogTracker.java:292)
        	at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:573)
        	at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:506)
        	at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:336)
        	at java.lang.Thread.run(Thread.java:619)
        
        "master-C3S31,20000,1323415196055" prio=10 tid=0x000000004036a000 nid=0x60dd waiting on condition [0x00007f708b6f5000]
           java.lang.Thread.State: TIMED_WAITING (sleeping)
        	at java.lang.Thread.sleep(Native Method)
        	at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:226)
        	at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:474)
        	at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:314)
        	at java.lang.Thread.run(Thread.java:662)
        
        Show
        ramkrishna.s.vasudevan added a comment - The problem is stopped variable is checked in HMaster.run() // We are either the active master or we were asked to shutdown if (! this .stopped) { finishInitialization(startupStatus); loop(); } But in finishInitialization we wait for ROOT to assign and it is a timed wait in that. So though the master is stopped we dont get the chance to check the status of the stopped variable in master. The same happens with splitLogAfterStartup also. Attaching the thread dumps "master-linux76,60000,1323444760834" prio=10 tid=0x085cdc00 nid=0x593d in Object .wait() [0x6fa6f000..0x6fa6ff50] java.lang. Thread .State: TIMED_WAITING (on object monitor) at java.lang. Object .wait(Native Method) - waiting on <0x74112e28> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.blockUntilAvailable(ZooKeeperNodeTracker.java:132) - locked <0x74112e28> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.blockUntilAvailable(ZooKeeperNodeTracker.java:104) - locked <0x74112e28> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRoot(CatalogTracker.java:292) at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:573) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:506) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:336) at java.lang. Thread .run( Thread .java:619) "master-C3S31,20000,1323415196055" prio=10 tid=0x000000004036a000 nid=0x60dd waiting on condition [0x00007f708b6f5000] java.lang. Thread .State: TIMED_WAITING (sleeping) at java.lang. Thread .sleep(Native Method) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:226) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:474) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:314) at java.lang. Thread .run( Thread .java:662)
        Hide
        xufeng added a comment -

        @ramkrishna
        thanks.
        do you think we should fix it in 0.90.
        I try to create a path.

        Show
        xufeng added a comment - @ramkrishna thanks. do you think we should fix it in 0.90. I try to create a path.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Xufeng.

        I used ./hbase-daemon.sh stop master and it stopped. My bad..

        if we use ./hbase master stop it doesnt stop.

        Show
        ramkrishna.s.vasudevan added a comment - @Xufeng. I used ./hbase-daemon.sh stop master and it stopped. My bad.. if we use ./hbase master stop it doesnt stop.
        Hide
        ramkrishna.s.vasudevan added a comment -

        The master is getting stopped in 0.92. I tried it.
        Correct me if am wrong.

        If it is ok, Can we close this issue?

        Show
        ramkrishna.s.vasudevan added a comment - The master is getting stopped in 0.92. I tried it. Correct me if am wrong. If it is ok, Can we close this issue?
        Hide
        stack added a comment -

        I believe this fixed in 0.92. Won't close till prove it.

        Show
        stack added a comment - I believe this fixed in 0.92. Won't close till prove it.

          People

          • Assignee:
            ramkrishna.s.vasudevan
            Reporter:
            xufeng
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development