Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-12450

Unbalance chaos monkey might kill all region servers without starting them back

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 0.98.8, 0.99.2
    • None
    • None
    • Reviewed

    Description

      UnbalanceKillAndRebalanceAction does kill, balance and then start of region servers. But if the balance fails exception is thrown causing the region servers to not start. For me, the balance always kept on failing with socket timeout (default 1 min) as master runs one iteration of balance for 5 mins (default config). Eventually all servers are killed but never started back.

      Attachments

        1. HBASE-12450.patch
          2 kB
          Virag Kothari
        2. HBASE-12450-0.98.patch
          2 kB
          Virag Kothari
        3. HBASE-12450.patch
          2 kB
          Virag Kothari

        Activity

          virag Virag Kothari added a comment -

          Attached is patch for master which just logs a warning if the balance fails.
          One unrelated log statement change

          virag Virag Kothari added a comment - Attached is patch for master which just logs a warning if the balance fails. One unrelated log statement change
          virag Virag Kothari added a comment -

          Thanks for the quick review Andrew.
          Attached is patch for 0.98. The patch for master is cleanly applying to branch-1

          virag Virag Kothari added a comment - Thanks for the quick review Andrew. Attached is patch for 0.98. The patch for master is cleanly applying to branch-1
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12680323/HBASE-12450-0.98.patch
          against trunk revision .
          ATTACHMENT ID: 12680323

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11617//console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680323/HBASE-12450-0.98.patch against trunk revision . ATTACHMENT ID: 12680323 +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 6 new or modified tests. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11617//console This message is automatically generated.
          virag Virag Kothari added a comment -

          Reattaching master patch as precommit ran against 0.98 patch

          virag Virag Kothari added a comment - Reattaching master patch as precommit ran against 0.98 patch
          enis Enis Soztutar added a comment -

          Admin.balancer() may throw some other exception than ServiceException (see HBASE-12072). So we should just catch Exception there. Other than that looks good.

          enis Enis Soztutar added a comment - Admin.balancer() may throw some other exception than ServiceException (see HBASE-12072 ). So we should just catch Exception there. Other than that looks good.

          Thanks enis I will make that amendment upon commit.

          apurtell Andrew Kyle Purtell added a comment - Thanks enis I will make that amendment upon commit.

          Pushing to 0.98+ shortly unless objection

          apurtell Andrew Kyle Purtell added a comment - Pushing to 0.98+ shortly unless objection
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12680338/HBASE-12450.patch
          against trunk revision .
          ATTACHMENT ID: 12680338

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 checkstyle. The applied patch does not increase the total number of checkstyle errors

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster

          -1 core zombie tests. There are 1 zombie test(s): at org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.testUsageWithMultipleContainersAndRMRestart(TestContainerResourceUsage.java:159)

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
          Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/checkstyle-aggregate.html

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680338/HBASE-12450.patch against trunk revision . ATTACHMENT ID: 12680338 +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 6 new or modified tests. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 checkstyle . The applied patch does not increase the total number of checkstyle errors +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster -1 core zombie tests . There are 1 zombie test(s): at org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.testUsageWithMultipleContainersAndRMRestart(TestContainerResourceUsage.java:159) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11619//console This message is automatically generated.

          Test failure seems unrelated to this change and Hadoop unit test zombie definitely is.

          apurtell Andrew Kyle Purtell added a comment - Test failure seems unrelated to this change and Hadoop unit test zombie definitely is.

          Pushed to 0.98+

          apurtell Andrew Kyle Purtell added a comment - Pushed to 0.98+
          hudson Hudson added a comment -

          FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #633 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/633/)
          HBASE-12450 Unbalance chaos monkey might kill all region servers without starting them back (Virag Kothari) (apurtell: rev 2a12bac8934f3faabc2a25441883c9829b9e157d)

          • hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
          • hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RestartRsHoldingTableAction.java
          hudson Hudson added a comment - FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #633 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/633/ ) HBASE-12450 Unbalance chaos monkey might kill all region servers without starting them back (Virag Kothari) (apurtell: rev 2a12bac8934f3faabc2a25441883c9829b9e157d) hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RestartRsHoldingTableAction.java
          hudson Hudson added a comment -

          SUCCESS: Integrated in HBase-1.0 #447 (See https://builds.apache.org/job/HBase-1.0/447/)
          HBASE-12450 Unbalance chaos monkey might kill all region servers without starting them back (Virag Kothari) (apurtell: rev 0145650cb0781cb0c1cc02c4e2354e22a395365a)

          • hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
          • hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RestartRsHoldingTableAction.java
          hudson Hudson added a comment - SUCCESS: Integrated in HBase-1.0 #447 (See https://builds.apache.org/job/HBase-1.0/447/ ) HBASE-12450 Unbalance chaos monkey might kill all region servers without starting them back (Virag Kothari) (apurtell: rev 0145650cb0781cb0c1cc02c4e2354e22a395365a) hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RestartRsHoldingTableAction.java
          hudson Hudson added a comment -

          SUCCESS: Integrated in HBase-TRUNK #5755 (See https://builds.apache.org/job/HBase-TRUNK/5755/)
          HBASE-12450 Unbalance chaos monkey might kill all region servers without starting them back (Virag Kothari) (apurtell: rev 3b8c0769ccb63633d8baa0d402bea7cbfaf94e7f)

          • hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RestartRsHoldingTableAction.java
          • hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
          hudson Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK #5755 (See https://builds.apache.org/job/HBase-TRUNK/5755/ ) HBASE-12450 Unbalance chaos monkey might kill all region servers without starting them back (Virag Kothari) (apurtell: rev 3b8c0769ccb63633d8baa0d402bea7cbfaf94e7f) hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RestartRsHoldingTableAction.java hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
          hudson Hudson added a comment -

          SUCCESS: Integrated in HBase-0.98 #664 (See https://builds.apache.org/job/HBase-0.98/664/)
          HBASE-12450 Unbalance chaos monkey might kill all region servers without starting them back (Virag Kothari) (apurtell: rev 2a12bac8934f3faabc2a25441883c9829b9e157d)

          • hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
          • hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RestartRsHoldingTableAction.java
          hudson Hudson added a comment - SUCCESS: Integrated in HBase-0.98 #664 (See https://builds.apache.org/job/HBase-0.98/664/ ) HBASE-12450 Unbalance chaos monkey might kill all region servers without starting them back (Virag Kothari) (apurtell: rev 2a12bac8934f3faabc2a25441883c9829b9e157d) hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RestartRsHoldingTableAction.java
          enis Enis Soztutar added a comment -

          Closing this issue after 0.99.2 release.

          enis Enis Soztutar added a comment - Closing this issue after 0.99.2 release.

          People

            virag Virag Kothari
            virag Virag Kothari
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: