Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-4397

-ROOT-, .META. tables stay offline for too long in recovery phase after all RSs are shutdown at the same time

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.92.0, 0.94.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      1. Shutdown all RSs.
      2. Bring all RS back online.

      The "ROOT", ".META." stay in offline state until timeout monitor force assignment 30 minutes later. That is because HMaster can't find a RS to assign the tables to in assign operation.

      011-09-13 13:25:52,743 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of ROOT,,0.70236052 to sea-lab-4,60020,1315870341387, trying to assign elsewhere instead; retry=0
      java.net.ConnectException: Connection refused
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
      at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
      at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:373)
      at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:345)
      at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1002)
      at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:854)
      at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:148)
      at $Proxy9.openRegion(Unknown Source)
      at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:407)
      at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1408)
      at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1153)
      at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1128)
      at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1123)
      at org.apache.hadoop.hbase.master.AssignmentManager.assignRoot(AssignmentManager.java:1788)
      at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.verifyAndAssignRoot(ServerShutdownHandler.java:100)
      at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.verifyAndAssignRootWithRetries(ServerShutdownHandler.java:118)
      at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:181)
      at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:167)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      at java.lang.Thread.run(Thread.java:662)
      2011-09-13 13:25:52,743 WARN org.apache.hadoop.hbase.master.AssignmentManager: Unable to find a viable location to assign region ROOT,,0.70236052

      Possible fixes:

      1. Have serverManager handle "server online" event similar to how RegionServerTracker.java calls servermanager.expireServer in the case server goes down.
      2. Make timeoutMonitor handle the situation better. This is a special situation in the cluster. 30 minutes timeout can be skipped.

        Issue Links

          Activity

          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          I assume the fix for HBASE-4203 wasn't included when this problem occurred.
          That patch went in with HBASE-4015.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - I assume the fix for HBASE-4203 wasn't included when this problem occurred. That patch went in with HBASE-4015 .
          Hide
          mingma Ming Ma added a comment -

          The cluster uses r1167378. So it has HBASE-4015. HBASE-4203's scenario is slightly different in which master restarts. This bug require master not to be restarted, instead all RSs restarted.

          Show
          mingma Ming Ma added a comment - The cluster uses r1167378. So it has HBASE-4015 . HBASE-4203 's scenario is slightly different in which master restarts. This bug require master not to be restarted, instead all RSs restarted.
          Hide
          jeason Jieshan Bean added a comment -

          So we should take special treatment on the exception which happens while ".META." or "ROOT" is being opened. "hbase.master.assignment.timeoutmonitor.timeout" is just used for user regions. 1800000 as the default timeout value seems too long.

          Show
          jeason Jieshan Bean added a comment - So we should take special treatment on the exception which happens while ".META." or " ROOT " is being opened. "hbase.master.assignment.timeoutmonitor.timeout" is just used for user regions. 1800000 as the default timeout value seems too long.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          @Ming ma/Bijieshan
          Yes. Thats why i had handled the scenario of master going down in HBASE-4203.
          I think there are variants in the same problem. Waiting for timeout for .META. and ROOT is not needed i feel. As it may block all operations.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - @Ming ma/Bijieshan Yes. Thats why i had handled the scenario of master going down in HBASE-4203 . I think there are variants in the same problem. Waiting for timeout for .META. and ROOT is not needed i feel. As it may block all operations.
          Hide
          mingma Ming Ma added a comment -

          There are two ways to address the issue.

          1. One way is to have special handling for "ROOT" and ".META." tables.
          2. Another way is to handle "all RSs just come back online while master is up all the time" scenario for all the regions.

          The patch uses the second approach.

          Show
          mingma Ming Ma added a comment - There are two ways to address the issue. 1. One way is to have special handling for " ROOT " and ".META." tables. 2. Another way is to handle "all RSs just come back online while master is up all the time" scenario for all the regions. The patch uses the second approach.
          Hide
          zhihyu@ebaysf.com Ted Yu added a comment -

          +1 on patch.

          Show
          zhihyu@ebaysf.com Ted Yu added a comment - +1 on patch.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Nice find and patch... +1

          (As a sidenote... Do we have to rethink this entire ROOT and META "huh hah"? There isn't a week going by without some new bug about races between splitting and assignment, or the master being stuck assigning ROOT/META, or similar cases. There are too many players that need to be kept in synch: The FS, ROOT/META, Zookeekper).

          Show
          lhofhansl Lars Hofhansl added a comment - Nice find and patch... +1 (As a sidenote... Do we have to rethink this entire ROOT and META "huh hah"? There isn't a week going by without some new bug about races between splitting and assignment, or the master being stuck assigning ROOT/META, or similar cases. There are too many players that need to be kept in synch: The FS, ROOT/META, Zookeekper).
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12508919/HBASE-4397-0.92.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 77 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.coprocessor.TestMasterObserver
          org.apache.hadoop.hbase.replication.TestReplication
          org.apache.hadoop.hbase.mapred.TestTableMapReduce
          org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/642//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/642//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/642//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508919/HBASE-4397-0.92.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 77 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.coprocessor.TestMasterObserver org.apache.hadoop.hbase.replication.TestReplication org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/642//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/642//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/642//console This message is automatically generated.
          Hide
          zhihyu@ebaysf.com Ted Yu added a comment -

          Integrated to 0.92 and TRUNK.

          Thanks for the patch Ming.

          Thanks for the review, Lars.

          Show
          zhihyu@ebaysf.com Ted Yu added a comment - Integrated to 0.92 and TRUNK. Thanks for the patch Ming. Thanks for the review, Lars.
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-TRUNK #2599 (See https://builds.apache.org/job/HBase-TRUNK/2599/)
          HBASE-4397 ROOT, .META. tables stay offline for too long in recovery phase after all RSs
          are shutdown at the same time (Ming Ma)

          tedyu :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          Show
          hudson Hudson added a comment - Integrated in HBase-TRUNK #2599 (See https://builds.apache.org/job/HBase-TRUNK/2599/ ) HBASE-4397 ROOT , .META. tables stay offline for too long in recovery phase after all RSs are shutdown at the same time (Ming Ma) tedyu : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-TRUNK-security #57 (See https://builds.apache.org/job/HBase-TRUNK-security/57/)
          HBASE-4397 ROOT, .META. tables stay offline for too long in recovery phase after all RSs
          are shutdown at the same time (Ming Ma)

          tedyu :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          Show
          hudson Hudson added a comment - Integrated in HBase-TRUNK-security #57 (See https://builds.apache.org/job/HBase-TRUNK-security/57/ ) HBASE-4397 ROOT , .META. tables stay offline for too long in recovery phase after all RSs are shutdown at the same time (Ming Ma) tedyu : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-0.92 #223 (See https://builds.apache.org/job/HBase-0.92/223/)
          HBASE-4397 ROOT, .META. tables stay offline for too long in recovery phase after all RSs
          are shutdown at the same time (Ming Ma)

          tedyu :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          Show
          hudson Hudson added a comment - Integrated in HBase-0.92 #223 (See https://builds.apache.org/job/HBase-0.92/223/ ) HBASE-4397 ROOT , .META. tables stay offline for too long in recovery phase after all RSs are shutdown at the same time (Ming Ma) tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-0.92-security #57 (See https://builds.apache.org/job/HBase-0.92-security/57/)
          HBASE-4397 ROOT, .META. tables stay offline for too long in recovery phase after all RSs
          are shutdown at the same time (Ming Ma)

          tedyu :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          Show
          hudson Hudson added a comment - Integrated in HBase-0.92-security #57 (See https://builds.apache.org/job/HBase-0.92-security/57/ ) HBASE-4397 ROOT , .META. tables stay offline for too long in recovery phase after all RSs are shutdown at the same time (Ming Ma) tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          Hide
          stack stack added a comment -

          Nice one Ming.

          Show
          stack stack added a comment - Nice one Ming.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment - - edited

          Resolving as committed to trunk and 0.92.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - - edited Resolving as committed to trunk and 0.92.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -
                RegionPlan plan = getRegionPlan(state, forceNewPlan);
                if (plan == null) {
                  debugLog(state.getRegion(),
                      "Unable to determine a plan to assign " + state);
                  return; // Should get reassigned later when RIT times out.
                }
          

          I think in this scenario also

          this.timeoutMonitor.setAllRegionServersOffline(true);
          

          this should be done.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - RegionPlan plan = getRegionPlan(state, forceNewPlan); if (plan == null ) { debugLog(state.getRegion(), "Unable to determine a plan to assign " + state); return ; // Should get reassigned later when RIT times out. } I think in this scenario also this .timeoutMonitor.setAllRegionServersOffline( true ); this should be done.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          The addendum is in HBASE-5237.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - The addendum is in HBASE-5237 .
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-TRUNK #2643 (See https://builds.apache.org/job/HBase-TRUNK/2643/)
          HBASE-5237 Addendum for HBASE-5160 and HBASE-4397 (Ram)

          ramkrishna :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          Show
          hudson Hudson added a comment - Integrated in HBase-TRUNK #2643 (See https://builds.apache.org/job/HBase-TRUNK/2643/ ) HBASE-5237 Addendum for HBASE-5160 and HBASE-4397 (Ram) ramkrishna : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-TRUNK-security #84 (See https://builds.apache.org/job/HBase-TRUNK-security/84/)
          HBASE-5237 Addendum for HBASE-5160 and HBASE-4397 (Ram)

          ramkrishna :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          Show
          hudson Hudson added a comment - Integrated in HBase-TRUNK-security #84 (See https://builds.apache.org/job/HBase-TRUNK-security/84/ ) HBASE-5237 Addendum for HBASE-5160 and HBASE-4397 (Ram) ramkrishna : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-0.92 #257 (See https://builds.apache.org/job/HBase-0.92/257/)
          HBASE-5237 Addendum for HBASE-5160 and HBASE-4397(Ram)

          ramkrishna :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          Show
          hudson Hudson added a comment - Integrated in HBase-0.92 #257 (See https://builds.apache.org/job/HBase-0.92/257/ ) HBASE-5237 Addendum for HBASE-5160 and HBASE-4397 (Ram) ramkrishna : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-0.92-security #88 (See https://builds.apache.org/job/HBase-0.92-security/88/)
          HBASE-5237 Addendum for HBASE-5160 and HBASE-4397(Ram)

          ramkrishna :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          Show
          hudson Hudson added a comment - Integrated in HBase-0.92-security #88 (See https://builds.apache.org/job/HBase-0.92-security/88/ ) HBASE-5237 Addendum for HBASE-5160 and HBASE-4397 (Ram) ramkrishna : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java

            People

            • Assignee:
              mingma Ming Ma
              Reporter:
              mingma Ming Ma
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development