Description
UnbalanceKillAndRebalanceAction does kill, balance and then start of region servers. But if the balance fails exception is thrown causing the region servers to not start. For me, the balance always kept on failing with socket timeout (default 1 min) as master runs one iteration of balance for 5 mins (default config). Eventually all servers are killed but never started back.
Attachments
Attachments
- HBASE-12450.patch
- 2 kB
- Virag Kothari
- HBASE-12450-0.98.patch
- 2 kB
- Virag Kothari
- HBASE-12450.patch
- 2 kB
- Virag Kothari
Activity
Thanks for the quick review Andrew.
Attached is patch for 0.98. The patch for master is cleanly applying to branch-1
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12680323/HBASE-12450-0.98.patch
against trunk revision .
ATTACHMENT ID: 12680323
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 6 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11617//console
This message is automatically generated.
Admin.balancer() may throw some other exception than ServiceException (see HBASE-12072). So we should just catch Exception there. Other than that looks good.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12680338/HBASE-12450.patch
against trunk revision .
ATTACHMENT ID: 12680338
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 6 new or modified tests.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 checkstyle. The applied patch does not increase the total number of checkstyle errors
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 lineLengths. The patch does not introduce lines longer than 100
+1 site. The mvn site goal succeeds with this patch.
-1 core tests. The patch failed these unit tests:
org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster
-1 core zombie tests. There are 1 zombie test(s): at org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.testUsageWithMultipleContainersAndRMRestart(TestContainerResourceUsage.java:159)
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/checkstyle-aggregate.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//console
This message is automatically generated.
Test failure seems unrelated to this change and Hadoop unit test zombie definitely is.
FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #633 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/633/)
HBASE-12450 Unbalance chaos monkey might kill all region servers without starting them back (Virag Kothari) (apurtell: rev 2a12bac8934f3faabc2a25441883c9829b9e157d)
- hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
- hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RestartRsHoldingTableAction.java
SUCCESS: Integrated in HBase-1.0 #447 (See https://builds.apache.org/job/HBase-1.0/447/)
HBASE-12450 Unbalance chaos monkey might kill all region servers without starting them back (Virag Kothari) (apurtell: rev 0145650cb0781cb0c1cc02c4e2354e22a395365a)
- hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
- hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RestartRsHoldingTableAction.java
SUCCESS: Integrated in HBase-TRUNK #5755 (See https://builds.apache.org/job/HBase-TRUNK/5755/)
HBASE-12450 Unbalance chaos monkey might kill all region servers without starting them back (Virag Kothari) (apurtell: rev 3b8c0769ccb63633d8baa0d402bea7cbfaf94e7f)
- hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RestartRsHoldingTableAction.java
- hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
SUCCESS: Integrated in HBase-0.98 #664 (See https://builds.apache.org/job/HBase-0.98/664/)
HBASE-12450 Unbalance chaos monkey might kill all region servers without starting them back (Virag Kothari) (apurtell: rev 2a12bac8934f3faabc2a25441883c9829b9e157d)
- hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
- hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RestartRsHoldingTableAction.java
Attached is patch for master which just logs a warning if the balance fails.
One unrelated log statement change