HBase
  1. HBase
  2. HBASE-5099

ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.92.0, 0.94.0
    • Fix Version/s: 0.92.0, 0.94.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager
      installed the tasks asynchronously, then started to wait for them to complete.

      The task znodes were not created actually. The requests were just queued.
      At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session.
      During the recovery, a new zookeeper connection was created. However, this master became the
      new master again. It tried to assign root and meta.

      Because the dead RS got the old root region, the master needs to wait for the log splitting to complete.
      This waiting holds the zookeeper event thread. So the async create split task is never retried since
      there is only one event thread, which is waiting for the root region assigned.

      1. ZK-event-thread-waiting-for-root.png
        35 kB
        Jimmy Xiang
      2. hbase-5099-v6.patch
        7 kB
        Jimmy Xiang
      3. hbase-5099-v5.patch
        7 kB
        Jimmy Xiang
      4. hbase-5099-v4.patch
        7 kB
        Jimmy Xiang
      5. hbase-5099-v3.patch
        7 kB
        Jimmy Xiang
      6. hbase-5099-v2.patch
        6 kB
        Jimmy Xiang
      7. hbase-5099.patch
        6 kB
        Jimmy Xiang
      8. distributed-log-splitting-hangs.png
        42 kB
        Jimmy Xiang
      9. 5099.92
        9 kB
        Ted Yu

        Issue Links

          Activity

          Hide
          Jimmy Xiang added a comment -

          The timeout monitor thread in split log manager keeps logging this:

          2011-12-27 00:00:06,759 DEBUG master.SplitLogManager (SplitLogManager.java:chore(846)) - total tasks = 12 unassigned = 12

          It shows no task assigned since no task znode was created yet.

          Checked zookeeper, and the there was nothing under /hbase/splitlog

          Show
          Jimmy Xiang added a comment - The timeout monitor thread in split log manager keeps logging this: 2011-12-27 00:00:06,759 DEBUG master.SplitLogManager (SplitLogManager.java:chore(846)) - total tasks = 12 unassigned = 12 It shows no task assigned since no task znode was created yet. Checked zookeeper, and the there was nothing under /hbase/splitlog
          Hide
          Jimmy Xiang added a comment -

          After we restarted the master, distributed log splitting can go through and the system is back to normal.

          This can happen if zk session is expired, during tryRecoveringExpiredZKSession(), if it cannot connect to the region server
          where root or meta is on, this region will be marked dead. Then distributed log splitting is kicked in. But it won't go
          through since the eventThread is waiting for the root/meta region assigned.

          So this problem will happen whenever master tryRecoveringExpiredZKSession and the region server with root/meta is not available.

          One fix is to using sync call in accessing zk in distributed log splitting. Another one is to abort the master anyway
          (don't try to recover since it may not be recoverable) if the region server with root/meta region is not reachable.

          Show
          Jimmy Xiang added a comment - After we restarted the master, distributed log splitting can go through and the system is back to normal. This can happen if zk session is expired, during tryRecoveringExpiredZKSession(), if it cannot connect to the region server where root or meta is on, this region will be marked dead. Then distributed log splitting is kicked in. But it won't go through since the eventThread is waiting for the root/meta region assigned. So this problem will happen whenever master tryRecoveringExpiredZKSession and the region server with root/meta is not available. One fix is to using sync call in accessing zk in distributed log splitting. Another one is to abort the master anyway (don't try to recover since it may not be recoverable) if the region server with root/meta region is not reachable.
          Hide
          Ted Yu added a comment -

          Another one is to abort the master

          The above sounds better. How about introducing a timeout for:

                this.assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO);
          

          after which master aborts ?

          Show
          Ted Yu added a comment - Another one is to abort the master The above sounds better. How about introducing a timeout for: this .assignmentManager.waitForAssignment(HRegionInfo.ROOT_REGIONINFO); after which master aborts ?
          Hide
          Jimmy Xiang added a comment -

          This is good. If introducing a timeout, I prefer to do it for tryRecoveringExpiredZKSession().
          The reason for that is, other than waitForAssignment, there are several other places which have the waiting logic as well,
          such as bulkAssign(), waitForRoot(), this.activeMasterManager.blockUntilBecomingActiveMaster(startupStatus), etc.

          Show
          Jimmy Xiang added a comment - This is good. If introducing a timeout, I prefer to do it for tryRecoveringExpiredZKSession(). The reason for that is, other than waitForAssignment, there are several other places which have the waiting logic as well, such as bulkAssign(), waitForRoot(), this.activeMasterManager.blockUntilBecomingActiveMaster(startupStatus), etc.
          Hide
          Ted Yu added a comment -

          tryRecoveringExpiredZKSession() is private and only called by abortNow().
          Are you suggesting adding timeout for all the methods mentioned above ?

          Show
          Ted Yu added a comment - tryRecoveringExpiredZKSession() is private and only called by abortNow(). Are you suggesting adding timeout for all the methods mentioned above ?
          Hide
          Jimmy Xiang added a comment -

          tryRecoveringExpiredZKSession() is only called by abortNow(), which is called by abort(), which is called by the
          eventThread. I was thinking to put this whole method in another thread with executor service and time it out after a certain time, for example, 5 minutes, then fails the recovery and let it abort.

          This way, we don't have to adding timeout for all the methods. The regular master startup is not impacted which calls assignRootAndMeta() too.

          However, if we know most likely just waitForAssignment() takes a long time, we can add timeout to this method only. But I am not so sure.

          Show
          Jimmy Xiang added a comment - tryRecoveringExpiredZKSession() is only called by abortNow(), which is called by abort(), which is called by the eventThread. I was thinking to put this whole method in another thread with executor service and time it out after a certain time, for example, 5 minutes, then fails the recovery and let it abort. This way, we don't have to adding timeout for all the methods. The regular master startup is not impacted which calls assignRootAndMeta() too. However, if we know most likely just waitForAssignment() takes a long time, we can add timeout to this method only. But I am not so sure.
          Hide
          Ted Yu added a comment -

          Timing out tryRecoveringExpiredZKSession() is good.
          We can introduce a separate thread to carry the current logic of tryRecoveringExpiredZKSession() and monitor this thread in eventThread.

          Show
          Ted Yu added a comment - Timing out tryRecoveringExpiredZKSession() is good. We can introduce a separate thread to carry the current logic of tryRecoveringExpiredZKSession() and monitor this thread in eventThread.
          Hide
          Jimmy Xiang added a comment -

          Cool, let me submit a patch.

          Show
          Jimmy Xiang added a comment - Cool, let me submit a patch.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/
          -----------------------------------------------------------

          Review request for hbase, Ted Yu and Michael Stack.

          Summary
          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.
          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs


          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6
          src/test/java/org/apache/hadoop/hbase/master/TestMaster.java b7a8270

          Diff: https://reviews.apache.org/r/3323/diff

          Testing
          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- Review request for hbase, Ted Yu and Michael Stack. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMaster.java b7a8270 Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/#review4136
          -----------------------------------------------------------

          Nice effort.

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          <https://reviews.apache.org/r/3323/#comment9339>

          We should check the return value here.
          False return value means timeout.

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          <https://reviews.apache.org/r/3323/#comment9340>

          I think we should catch ExecutionException (ee) coming out of the get() call and rethrow ee.getCause().

          src/test/java/org/apache/hadoop/hbase/master/TestMaster.java
          <https://reviews.apache.org/r/3323/#comment9343>

          TestMaster doesn't have timeout parameter.
          We'd better add timeout parameter here.

          src/test/java/org/apache/hadoop/hbase/master/TestMaster.java
          <https://reviews.apache.org/r/3323/#comment9341>

          Type: recovery.

          src/test/java/org/apache/hadoop/hbase/master/TestMaster.java
          <https://reviews.apache.org/r/3323/#comment9342>

          I would expect some assertion here.

          • Ted

          On 2011-12-28 19:31:45, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3323/

          -----------------------------------------------------------

          (Updated 2011-12-28 19:31:45)

          Review request for hbase, Ted Yu and Michael Stack.

          Summary

          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.

          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6

          src/test/java/org/apache/hadoop/hbase/master/TestMaster.java b7a8270

          Diff: https://reviews.apache.org/r/3323/diff

          Testing

          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/#review4136 ----------------------------------------------------------- Nice effort. src/main/java/org/apache/hadoop/hbase/master/HMaster.java < https://reviews.apache.org/r/3323/#comment9339 > We should check the return value here. False return value means timeout. src/main/java/org/apache/hadoop/hbase/master/HMaster.java < https://reviews.apache.org/r/3323/#comment9340 > I think we should catch ExecutionException (ee) coming out of the get() call and rethrow ee.getCause(). src/test/java/org/apache/hadoop/hbase/master/TestMaster.java < https://reviews.apache.org/r/3323/#comment9343 > TestMaster doesn't have timeout parameter. We'd better add timeout parameter here. src/test/java/org/apache/hadoop/hbase/master/TestMaster.java < https://reviews.apache.org/r/3323/#comment9341 > Type: recovery. src/test/java/org/apache/hadoop/hbase/master/TestMaster.java < https://reviews.apache.org/r/3323/#comment9342 > I would expect some assertion here. Ted On 2011-12-28 19:31:45, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- (Updated 2011-12-28 19:31:45) Review request for hbase, Ted Yu and Michael Stack. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMaster.java b7a8270 Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12508783/hbase-5099.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.mapred.TestTableMapReduce
          org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/613//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/613//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/613//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508783/hbase-5099.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/613//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/613//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/613//console This message is automatically generated.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/
          -----------------------------------------------------------

          (Updated 2011-12-28 22:38:05.839771)

          Review request for hbase, Ted Yu and Michael Stack.

          Changes
          -------

          Thanks for the comment. Here is a new patch.

          Summary
          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.
          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs (updated)


          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6
          src/test/java/org/apache/hadoop/hbase/master/TestMaster.java b7a8270

          Diff: https://reviews.apache.org/r/3323/diff

          Testing
          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- (Updated 2011-12-28 22:38:05.839771) Review request for hbase, Ted Yu and Michael Stack. Changes ------- Thanks for the comment. Here is a new patch. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs (updated) src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMaster.java b7a8270 Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/#review4138
          -----------------------------------------------------------

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          <https://reviews.apache.org/r/3323/#comment9344>

          I think declaring ExecutionException is better than declaring Throwable.
          See comment below.

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          <https://reviews.apache.org/r/3323/#comment9345>

          The call() at line 1429 would only throw 3 types of exceptions: IE, IOE and KE.

          So the cause should be one the three types.
          I suggest using instanceof to check against each of them and if a match is found, cast as that exception type and rethrow.

          If there is no match, I suggest using this ctor to throw a new IOException:
          IOException(String message, Throwable cause)

          • Ted

          On 2011-12-28 22:38:05, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3323/

          -----------------------------------------------------------

          (Updated 2011-12-28 22:38:05)

          Review request for hbase, Ted Yu and Michael Stack.

          Summary

          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.

          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6

          src/test/java/org/apache/hadoop/hbase/master/TestMaster.java b7a8270

          Diff: https://reviews.apache.org/r/3323/diff

          Testing

          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/#review4138 ----------------------------------------------------------- src/main/java/org/apache/hadoop/hbase/master/HMaster.java < https://reviews.apache.org/r/3323/#comment9344 > I think declaring ExecutionException is better than declaring Throwable. See comment below. src/main/java/org/apache/hadoop/hbase/master/HMaster.java < https://reviews.apache.org/r/3323/#comment9345 > The call() at line 1429 would only throw 3 types of exceptions: IE, IOE and KE. So the cause should be one the three types. I suggest using instanceof to check against each of them and if a match is found, cast as that exception type and rethrow. If there is no match, I suggest using this ctor to throw a new IOException: IOException(String message, Throwable cause) Ted On 2011-12-28 22:38:05, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- (Updated 2011-12-28 22:38:05) Review request for hbase, Ted Yu and Michael Stack. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMaster.java b7a8270 Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          Jimmy Xiang added a comment -

          Updated per review comments. The failed tests: TestHFileOutputFormat,TestTableMapReduce work fine on my box.

          Show
          Jimmy Xiang added a comment - Updated per review comments. The failed tests: TestHFileOutputFormat,TestTableMapReduce work fine on my box.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12508799/hbase-5099-v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
          org.apache.hadoop.hbase.mapred.TestTableMapReduce

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/617//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/617//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/617//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508799/hbase-5099-v2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.mapred.TestTableMapReduce Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/617//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/617//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/617//console This message is automatically generated.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/
          -----------------------------------------------------------

          (Updated 2011-12-29 18:38:54.473524)

          Review request for hbase, Ted Yu and Michael Stack.

          Changes
          -------

          Updated per Ted's comment. Also separated the test class and added a possible testcase.

          Summary
          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.
          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs (updated)


          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6
          src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java PRE-CREATION

          Diff: https://reviews.apache.org/r/3323/diff

          Testing
          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- (Updated 2011-12-29 18:38:54.473524) Review request for hbase, Ted Yu and Michael Stack. Changes ------- Updated per Ted's comment. Also separated the test class and added a possible testcase. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs (updated) src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java PRE-CREATION Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/#review4148
          -----------------------------------------------------------

          The code is much closer to what was in my mind.
          The new test needs improvement.

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          <https://reviews.apache.org/r/3323/#comment9361>

          Line is 88 chars long.
          If ExecutorService is imported, the line should be much shorter.

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          <https://reviews.apache.org/r/3323/#comment9364>

          I suggest recording the amount of time taken by the recovery and log it after executor.awaitTermination() returns.

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          <https://reviews.apache.org/r/3323/#comment9362>

          Timeout value should be included in the exception message.

          src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java
          <https://reviews.apache.org/r/3323/#comment9365>

          Year is not needed.

          src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java
          <https://reviews.apache.org/r/3323/#comment9363>

          This shouldn't be small test since mini cluster is involved.

          src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java
          <https://reviews.apache.org/r/3323/#comment9366>

          Please add short javadoc for this new class.

          src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java
          <https://reviews.apache.org/r/3323/#comment9368>

          We shouldn't pass 1 here since that means 1 master.

          src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java
          <https://reviews.apache.org/r/3323/#comment9367>

          Since the test doesn't involve standby master, I think we should use a different name.

          • Ted

          On 2011-12-29 18:38:54, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3323/

          -----------------------------------------------------------

          (Updated 2011-12-29 18:38:54)

          Review request for hbase, Ted Yu and Michael Stack.

          Summary

          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.

          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6

          src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java PRE-CREATION

          Diff: https://reviews.apache.org/r/3323/diff

          Testing

          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/#review4148 ----------------------------------------------------------- The code is much closer to what was in my mind. The new test needs improvement. src/main/java/org/apache/hadoop/hbase/master/HMaster.java < https://reviews.apache.org/r/3323/#comment9361 > Line is 88 chars long. If ExecutorService is imported, the line should be much shorter. src/main/java/org/apache/hadoop/hbase/master/HMaster.java < https://reviews.apache.org/r/3323/#comment9364 > I suggest recording the amount of time taken by the recovery and log it after executor.awaitTermination() returns. src/main/java/org/apache/hadoop/hbase/master/HMaster.java < https://reviews.apache.org/r/3323/#comment9362 > Timeout value should be included in the exception message. src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java < https://reviews.apache.org/r/3323/#comment9365 > Year is not needed. src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java < https://reviews.apache.org/r/3323/#comment9363 > This shouldn't be small test since mini cluster is involved. src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java < https://reviews.apache.org/r/3323/#comment9366 > Please add short javadoc for this new class. src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java < https://reviews.apache.org/r/3323/#comment9368 > We shouldn't pass 1 here since that means 1 master. src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java < https://reviews.apache.org/r/3323/#comment9367 > Since the test doesn't involve standby master, I think we should use a different name. Ted On 2011-12-29 18:38:54, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- (Updated 2011-12-29 18:38:54) Review request for hbase, Ted Yu and Michael Stack. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java PRE-CREATION Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          Ted Yu added a comment -

          We shouldn't pass 1 here since that means 1 master.

          The new test is different from TestMasterFailover so I guess 1 could be fine here.

          How about naming the new test TestZKSessionRecoveryInMaster ?

          Show
          Ted Yu added a comment - We shouldn't pass 1 here since that means 1 master. The new test is different from TestMasterFailover so I guess 1 could be fine here. How about naming the new test TestZKSessionRecoveryInMaster ?
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-12-29 19:02:17, Ted Yu wrote:

          > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1475

          > <https://reviews.apache.org/r/3323/diff/3/?file=65929#file65929line1475>

          >

          > Timeout value should be included in the exception message.

          It may not because of timeout.

          On 2011-12-29 19:02:17, Ted Yu wrote:

          > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 2

          > <https://reviews.apache.org/r/3323/diff/3/?file=65930#file65930line2>

          >

          > Year is not needed.

          I copied it from another test case. I can remove it.

          On 2011-12-29 19:02:17, Ted Yu wrote:

          > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 34

          > <https://reviews.apache.org/r/3323/diff/3/?file=65930#file65930line34>

          >

          > This shouldn't be small test since mini cluster is involved.

          Ok.

          On 2011-12-29 19:02:17, Ted Yu wrote:

          > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 41

          > <https://reviews.apache.org/r/3323/diff/3/?file=65930#file65930line41>

          >

          > We shouldn't pass 1 here since that means 1 master.

          It means 1 region server. Actually, I do want 1 master so it can recover itself. Otherwise, the backup master will take over and the active master doesn't have a chance to recover in this scenario.

          On 2011-12-29 19:02:17, Ted Yu wrote:

          > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1452

          > <https://reviews.apache.org/r/3323/diff/3/?file=65929#file65929line1452>

          >

          > Line is 88 chars long.

          > If ExecutorService is imported, the line should be much shorter.

          We can't do this since there is already an ExecutorService (from hbase) imported. I can't use the hbase ExecutorService because it doesn't fit.

          On 2011-12-29 19:02:17, Ted Yu wrote:

          > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 50

          > <https://reviews.apache.org/r/3323/diff/3/?file=65930#file65930line50>

          >

          > Since the test doesn't involve standby master, I think we should use a different name.

          There is test case called TestMasterFailover which involves standby master. That's why I called it TestMasterRecovery. How about TestMasterZKSessionRecovery?

          • Jimmy

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/#review4148
          -----------------------------------------------------------

          On 2011-12-29 18:38:54, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3323/

          -----------------------------------------------------------

          (Updated 2011-12-29 18:38:54)

          Review request for hbase, Ted Yu and Michael Stack.

          Summary

          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.

          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6

          src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java PRE-CREATION

          Diff: https://reviews.apache.org/r/3323/diff

          Testing

          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-12-29 19:02:17, Ted Yu wrote: > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1475 > < https://reviews.apache.org/r/3323/diff/3/?file=65929#file65929line1475 > > > Timeout value should be included in the exception message. It may not because of timeout. On 2011-12-29 19:02:17, Ted Yu wrote: > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 2 > < https://reviews.apache.org/r/3323/diff/3/?file=65930#file65930line2 > > > Year is not needed. I copied it from another test case. I can remove it. On 2011-12-29 19:02:17, Ted Yu wrote: > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 34 > < https://reviews.apache.org/r/3323/diff/3/?file=65930#file65930line34 > > > This shouldn't be small test since mini cluster is involved. Ok. On 2011-12-29 19:02:17, Ted Yu wrote: > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 41 > < https://reviews.apache.org/r/3323/diff/3/?file=65930#file65930line41 > > > We shouldn't pass 1 here since that means 1 master. It means 1 region server. Actually, I do want 1 master so it can recover itself. Otherwise, the backup master will take over and the active master doesn't have a chance to recover in this scenario. On 2011-12-29 19:02:17, Ted Yu wrote: > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1452 > < https://reviews.apache.org/r/3323/diff/3/?file=65929#file65929line1452 > > > Line is 88 chars long. > If ExecutorService is imported, the line should be much shorter. We can't do this since there is already an ExecutorService (from hbase) imported. I can't use the hbase ExecutorService because it doesn't fit. On 2011-12-29 19:02:17, Ted Yu wrote: > src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java, line 50 > < https://reviews.apache.org/r/3323/diff/3/?file=65930#file65930line50 > > > Since the test doesn't involve standby master, I think we should use a different name. There is test case called TestMasterFailover which involves standby master. That's why I called it TestMasterRecovery. How about TestMasterZKSessionRecovery? Jimmy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/#review4148 ----------------------------------------------------------- On 2011-12-29 18:38:54, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- (Updated 2011-12-29 18:38:54) Review request for hbase, Ted Yu and Michael Stack. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java PRE-CREATION Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          Ted Yu added a comment -

          TestMasterZKSessionRecovery is a good name for the new test.

          Show
          Ted Yu added a comment - TestMasterZKSessionRecovery is a good name for the new test.
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-12-28 22:52:56, Ted Yu wrote:

          > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1463

          > <https://reviews.apache.org/r/3323/diff/2/?file=65902#file65902line1463>

          >

          > The call() at line 1429 would only throw 3 types of exceptions: IE, IOE and KE.

          >

          > So the cause should be one the three types.

          > I suggest using instanceof to check against each of them and if a match is found, cast as that exception type and rethrow.

          >

          > If there is no match, I suggest using this ctor to throw a new IOException:

          > IOException(String message, Throwable cause)

          How about just throw ExecutionException? It will be much simpler and the caller doesn't care about which exception it is in this case.

          • Jimmy

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/#review4138
          -----------------------------------------------------------

          On 2011-12-29 18:38:54, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3323/

          -----------------------------------------------------------

          (Updated 2011-12-29 18:38:54)

          Review request for hbase, Ted Yu and Michael Stack.

          Summary

          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.

          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6

          src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java PRE-CREATION

          Diff: https://reviews.apache.org/r/3323/diff

          Testing

          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-12-28 22:52:56, Ted Yu wrote: > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1463 > < https://reviews.apache.org/r/3323/diff/2/?file=65902#file65902line1463 > > > The call() at line 1429 would only throw 3 types of exceptions: IE, IOE and KE. > > So the cause should be one the three types. > I suggest using instanceof to check against each of them and if a match is found, cast as that exception type and rethrow. > > If there is no match, I suggest using this ctor to throw a new IOException: > IOException(String message, Throwable cause) How about just throw ExecutionException? It will be much simpler and the caller doesn't care about which exception it is in this case. Jimmy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/#review4138 ----------------------------------------------------------- On 2011-12-29 18:38:54, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- (Updated 2011-12-29 18:38:54) Review request for hbase, Ted Yu and Michael Stack. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMasterRecovery.java PRE-CREATION Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          Ted Yu added a comment -

          How about just throw ExecutionException?

          That is fine. tryRecoveringExpiredZKSession() is private.

          Show
          Ted Yu added a comment - How about just throw ExecutionException? That is fine. tryRecoveringExpiredZKSession() is private.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12508868/hbase-5099-v3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/628//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/628//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/628//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508868/hbase-5099-v3.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/628//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/628//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/628//console This message is automatically generated.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/
          -----------------------------------------------------------

          (Updated 2011-12-29 20:41:26.086617)

          Review request for hbase, Ted Yu and Michael Stack.

          Changes
          -------

          Renamed the test class. tryRecoveringExpiredZKSession() throws ExecutionException now.

          Summary
          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.
          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs (updated)


          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6
          src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION

          Diff: https://reviews.apache.org/r/3323/diff

          Testing
          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- (Updated 2011-12-29 20:41:26.086617) Review request for hbase, Ted Yu and Michael Stack. Changes ------- Renamed the test class. tryRecoveringExpiredZKSession() throws ExecutionException now. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs (updated) src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/#review4152
          -----------------------------------------------------------

          I like this version.

          Minor comments below.

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          <https://reviews.apache.org/r/3323/#comment9377>

          Line should wrap after = sign.

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          <https://reviews.apache.org/r/3323/#comment9376>

          The catch block is redundant.

          src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
          <https://reviews.apache.org/r/3323/#comment9378>

          Javadoc for test class.

          src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
          <https://reviews.apache.org/r/3323/#comment9379>

          How about giving "hbase.master.zksession.recover.timeout" a much small value for the mini cluster so that the timeout value on line 73 can be lowered ?

          • Ted

          On 2011-12-29 20:41:26, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3323/

          -----------------------------------------------------------

          (Updated 2011-12-29 20:41:26)

          Review request for hbase, Ted Yu and Michael Stack.

          Summary

          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.

          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6

          src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION

          Diff: https://reviews.apache.org/r/3323/diff

          Testing

          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/#review4152 ----------------------------------------------------------- I like this version. Minor comments below. src/main/java/org/apache/hadoop/hbase/master/HMaster.java < https://reviews.apache.org/r/3323/#comment9377 > Line should wrap after = sign. src/main/java/org/apache/hadoop/hbase/master/HMaster.java < https://reviews.apache.org/r/3323/#comment9376 > The catch block is redundant. src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java < https://reviews.apache.org/r/3323/#comment9378 > Javadoc for test class. src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java < https://reviews.apache.org/r/3323/#comment9379 > How about giving "hbase.master.zksession.recover.timeout" a much small value for the mini cluster so that the timeout value on line 73 can be lowered ? Ted On 2011-12-29 20:41:26, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- (Updated 2011-12-29 20:41:26) Review request for hbase, Ted Yu and Michael Stack. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-12-29 20:55:31, Ted Yu wrote:

          > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1463

          > <https://reviews.apache.org/r/3323/diff/4/?file=65969#file65969line1463>

          >

          > The catch block is redundant.

          That's right. I forgot it

          On 2011-12-29 20:55:31, Ted Yu wrote:

          > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1453

          > <https://reviews.apache.org/r/3323/diff/4/?file=65969#file65969line1453>

          >

          > Line should wrap after = sign.

          too long? sure.

          On 2011-12-29 20:55:31, Ted Yu wrote:

          > src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java, line 41

          > <https://reviews.apache.org/r/3323/diff/4/?file=65970#file65970line41>

          >

          > How about giving "hbase.master.zksession.recover.timeout" a much small value for the mini cluster so that the timeout value on line 73 can be lowered ?

          sure

          • Jimmy

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/#review4152
          -----------------------------------------------------------

          On 2011-12-29 20:41:26, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3323/

          -----------------------------------------------------------

          (Updated 2011-12-29 20:41:26)

          Review request for hbase, Ted Yu and Michael Stack.

          Summary

          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.

          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6

          src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION

          Diff: https://reviews.apache.org/r/3323/diff

          Testing

          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-12-29 20:55:31, Ted Yu wrote: > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1463 > < https://reviews.apache.org/r/3323/diff/4/?file=65969#file65969line1463 > > > The catch block is redundant. That's right. I forgot it On 2011-12-29 20:55:31, Ted Yu wrote: > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1453 > < https://reviews.apache.org/r/3323/diff/4/?file=65969#file65969line1453 > > > Line should wrap after = sign. too long? sure. On 2011-12-29 20:55:31, Ted Yu wrote: > src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java, line 41 > < https://reviews.apache.org/r/3323/diff/4/?file=65970#file65970line41 > > > How about giving "hbase.master.zksession.recover.timeout" a much small value for the mini cluster so that the timeout value on line 73 can be lowered ? sure Jimmy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/#review4152 ----------------------------------------------------------- On 2011-12-29 20:41:26, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- (Updated 2011-12-29 20:41:26) Review request for hbase, Ted Yu and Michael Stack. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/
          -----------------------------------------------------------

          (Updated 2011-12-29 21:16:22.655431)

          Review request for hbase, Ted Yu and Michael Stack.

          Changes
          -------

          Minor changes per review comments.

          Summary
          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.
          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs (updated)


          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6
          src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION

          Diff: https://reviews.apache.org/r/3323/diff

          Testing
          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- (Updated 2011-12-29 21:16:22.655431) Review request for hbase, Ted Yu and Michael Stack. Changes ------- Minor changes per review comments. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs (updated) src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          Ted Yu added a comment -

          @Jimmy:
          Do you mind producing a patch for 0.92 ?
          I got the following when I tried to apply patch v5 to 0.92:

          1 out of 3 hunks FAILED -- saving rejects to file src/main/java/org/apache/hadoop/hbase/master/HMaster.java.rej
          

          Also there is no test category in 0.92:

          [ERROR] /Users/zhihyu/92hbase/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java:[27,30] cannot find symbol
          [ERROR] symbol  : class MediumTests
          [ERROR] location: package org.apache.hadoop.hbase
          

          This is a whitespace at line 41 in TestMasterZKSessionRecovery - I can remove it at time of integration.

          Nice work.

          Show
          Ted Yu added a comment - @Jimmy: Do you mind producing a patch for 0.92 ? I got the following when I tried to apply patch v5 to 0.92: 1 out of 3 hunks FAILED -- saving rejects to file src/main/java/org/apache/hadoop/hbase/master/HMaster.java.rej Also there is no test category in 0.92: [ERROR] /Users/zhihyu/92hbase/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java:[27,30] cannot find symbol [ERROR] symbol : class MediumTests [ERROR] location: package org.apache.hadoop.hbase This is a whitespace at line 41 in TestMasterZKSessionRecovery - I can remove it at time of integration. Nice work.
          Hide
          Ted Yu added a comment -

          Will integrate tomorrow morning if I don't get objections.

          Show
          Ted Yu added a comment - Will integrate tomorrow morning if I don't get objections.
          Hide
          Ted Yu added a comment -

          Patch for 0.92 branch where all TestMasterXX tests passed.

          Show
          Ted Yu added a comment - Patch for 0.92 branch where all TestMasterXX tests passed.
          Hide
          Jimmy Xiang added a comment -

          @Ted, thanks!

          Show
          Jimmy Xiang added a comment - @Ted, thanks!
          Hide
          Ted Yu added a comment -

          @Jimmy:
          Can you put the patch on a real cluster and see if the 300 second timeout is good for zk session recovery ?

          Thanks

          Show
          Ted Yu added a comment - @Jimmy: Can you put the patch on a real cluster and see if the 300 second timeout is good for zk session recovery ? Thanks
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/#review4154
          -----------------------------------------------------------

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          <https://reviews.apache.org/r/3323/#comment9383>

          Should we issue executor.shutdownNow() here if the worker did not finish, in order to attempt to interrupt and long running worker?

          • Lars

          On 2011-12-29 21:16:22, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3323/

          -----------------------------------------------------------

          (Updated 2011-12-29 21:16:22)

          Review request for hbase, Ted Yu and Michael Stack.

          Summary

          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.

          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6

          src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION

          Diff: https://reviews.apache.org/r/3323/diff

          Testing

          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/#review4154 ----------------------------------------------------------- src/main/java/org/apache/hadoop/hbase/master/HMaster.java < https://reviews.apache.org/r/3323/#comment9383 > Should we issue executor.shutdownNow() here if the worker did not finish, in order to attempt to interrupt and long running worker? Lars On 2011-12-29 21:16:22, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- (Updated 2011-12-29 21:16:22) Review request for hbase, Ted Yu and Michael Stack. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          Ted Yu added a comment -

          I think executor.shutdown() is appropriate here since we want the original body of tryRecoveringExpiredZKSession() to complete.
          There is new timeout governing the wait for shutdown to complete.

          Show
          Ted Yu added a comment - I think executor.shutdown() is appropriate here since we want the original body of tryRecoveringExpiredZKSession() to complete. There is new timeout governing the wait for shutdown to complete.
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-12-29 22:29:34, Lars Hofhansl wrote:

          > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1464

          > <https://reviews.apache.org/r/3323/diff/5/?file=65971#file65971line1464>

          >

          > Should we issue executor.shutdownNow() here if the worker did not finish, in order to attempt to interrupt and long running worker?

          Yes, that will be safer.

          • Jimmy

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/#review4154
          -----------------------------------------------------------

          On 2011-12-29 21:16:22, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3323/

          -----------------------------------------------------------

          (Updated 2011-12-29 21:16:22)

          Review request for hbase, Ted Yu and Michael Stack.

          Summary

          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.

          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6

          src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION

          Diff: https://reviews.apache.org/r/3323/diff

          Testing

          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-12-29 22:29:34, Lars Hofhansl wrote: > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1464 > < https://reviews.apache.org/r/3323/diff/5/?file=65971#file65971line1464 > > > Should we issue executor.shutdownNow() here if the worker did not finish, in order to attempt to interrupt and long running worker? Yes, that will be safer. Jimmy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/#review4154 ----------------------------------------------------------- On 2011-12-29 21:16:22, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- (Updated 2011-12-29 21:16:22) Review request for hbase, Ted Yu and Michael Stack. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          Lars Hofhansl added a comment -

          Sorry I did not state this clearly. I meant if (and only if) the awaitTerminition times out, we should call shutdownNow() to interrupt the worker (that is presumably still working).

          Like this:

          if (executor.awaitTermination(timeout, TimeUnit.MILLISECONDS)
              && result.isDone()) {
            Boolean recovered = result.get();
            if (recovered != null) {
              return recovered.booleanValue();
            }
          } else {
            executor.shutdownNow();
          }
          return false;
          
          Show
          Lars Hofhansl added a comment - Sorry I did not state this clearly. I meant if (and only if) the awaitTerminition times out, we should call shutdownNow() to interrupt the worker (that is presumably still working). Like this: if (executor.awaitTermination(timeout, TimeUnit.MILLISECONDS) && result.isDone()) { Boolean recovered = result.get(); if (recovered != null ) { return recovered.booleanValue(); } } else { executor.shutdownNow(); } return false ;
          Hide
          Jimmy Xiang added a comment -

          Updated per Lars's comment. It will be safer.

          Show
          Jimmy Xiang added a comment - Updated per Lars's comment. It will be safer.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/
          -----------------------------------------------------------

          (Updated 2011-12-29 22:48:36.725034)

          Review request for hbase, Ted Yu and Michael Stack.

          Summary
          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.
          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs (updated)


          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6
          src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION

          Diff: https://reviews.apache.org/r/3323/diff

          Testing
          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- (Updated 2011-12-29 22:48:36.725034) Review request for hbase, Ted Yu and Michael Stack. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs (updated) src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3323/#review4156
          -----------------------------------------------------------

          Ship it!

          • Lars

          On 2011-12-29 22:48:36, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3323/

          -----------------------------------------------------------

          (Updated 2011-12-29 22:48:36)

          Review request for hbase, Ted Yu and Michael Stack.

          Summary

          -------

          Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere.

          I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested.

          This addresses bug HBASE-5099.

          https://issues.apache.org/jira/browse/HBASE-5099

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6

          src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION

          Diff: https://reviews.apache.org/r/3323/diff

          Testing

          -------

          mvn -PlocalTests -Dtest=TestMaster* clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/#review4156 ----------------------------------------------------------- Ship it! Lars On 2011-12-29 22:48:36, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3323/ ----------------------------------------------------------- (Updated 2011-12-29 22:48:36) Review request for hbase, Ted Yu and Michael Stack. Summary ------- Per discussion with Ted (on issues), I put up a patch to run tryRecoveringExpiredZKSession() in a separate thread and time it out and fail the recovery if it is stuck somewhere. I added a test to test the abort method. However, for the mini cluster, becomeActiveMaster() doesn't succeed so the abort method ends up always aborted. So the actually success recovery is not tested. This addresses bug HBASE-5099 . https://issues.apache.org/jira/browse/HBASE-5099 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/HMaster.java a5935a6 src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java PRE-CREATION Diff: https://reviews.apache.org/r/3323/diff Testing ------- mvn -PlocalTests -Dtest=TestMaster* clean test Thanks, Jimmy
          Hide
          Jimmy Xiang added a comment -

          There is no harm to call shutdownNow() even if awaitTerminition is not timed out.

          So this should be fine:

          if (executor.awaitTermination(timeout, TimeUnit.MILLISECONDS)
          && result.isDone()) {
          Boolean recovered = result.get();
          if (recovered != null)

          { return recovered.booleanValue(); }

          }
          executor.shutdownNow();

          Show
          Jimmy Xiang added a comment - There is no harm to call shutdownNow() even if awaitTerminition is not timed out. So this should be fine: if (executor.awaitTermination(timeout, TimeUnit.MILLISECONDS) && result.isDone()) { Boolean recovered = result.get(); if (recovered != null) { return recovered.booleanValue(); } } executor.shutdownNow();
          Hide
          Lars Hofhansl added a comment -

          Yep... That's better.

          Show
          Lars Hofhansl added a comment - Yep... That's better.
          Hide
          Ted Yu added a comment -

          Add call to shutdownNow().

          Show
          Ted Yu added a comment - Add call to shutdownNow().
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12508893/hbase-5099-v6.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/634//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/634//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/634//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508893/hbase-5099-v6.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/634//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/634//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/634//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12508893/hbase-5099-v6.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.mapred.TestTableMapReduce
          org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/632//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/632//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/632//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508893/hbase-5099-v6.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/632//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/632//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/632//console This message is automatically generated.
          Hide
          Ted Yu added a comment -

          Integrated to 0.92 and TRUNK, respectively.

          Thanks for the patch Jimmy.

          Thanks for the suggestion, Lars.

          Show
          Ted Yu added a comment - Integrated to 0.92 and TRUNK, respectively. Thanks for the patch Jimmy. Thanks for the suggestion, Lars.
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #215 (See https://builds.apache.org/job/HBase-0.92/215/)
          HBASE-5099 ZK event thread waiting for root region assignment may block server
          shutdown handler for the region sever the root region was on (Jimmy)

          tedyu :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
          Show
          Hudson added a comment - Integrated in HBase-0.92 #215 (See https://builds.apache.org/job/HBase-0.92/215/ ) HBASE-5099 ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on (Jimmy) tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2593 (See https://builds.apache.org/job/HBase-TRUNK/2593/)
          HBASE-5099 ZK event thread waiting for root region assignment may block server
          shutdown handler for the region sever the root region was on (Jimmy)

          tedyu :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2593 (See https://builds.apache.org/job/HBase-TRUNK/2593/ ) HBASE-5099 ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on (Jimmy) tedyu : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
          Hide
          Ted Yu added a comment -

          0.92 Jenkins builds have failed 4 times in a roll.

          TestReplication#queueFailover failed in builds 217 and 218.
          It failed consistently on MacBook as well.

          Rolling back the patches.

          Show
          Ted Yu added a comment - 0.92 Jenkins builds have failed 4 times in a roll. TestReplication#queueFailover failed in builds 217 and 218. It failed consistently on MacBook as well. Rolling back the patches.
          Hide
          Ted Yu added a comment -

          I reverted 0.92 patch.

          Now TestReplication passes on Mac.

          Let's find out if the patch is related to replication test failure or not.

          Keeping TRUNK patch in TRUNK for now since trunk build 2594 passed.

          Show
          Ted Yu added a comment - I reverted 0.92 patch. Now TestReplication passes on Mac. Let's find out if the patch is related to replication test failure or not. Keeping TRUNK patch in TRUNK for now since trunk build 2594 passed.
          Hide
          Jimmy Xiang added a comment -

          TestReplication is flaky. But it works on my ubuntu box.
          Let me take a look.

          Show
          Jimmy Xiang added a comment - TestReplication is flaky. But it works on my ubuntu box. Let me take a look.
          Hide
          Ted Yu added a comment -

          Please read through the test output of 0.92 builds 217 and 218.
          With patch 5099.92, the test failure is reproducible on MacBook.

          Another validation is to deploy patch 5099.92 to real clusters and see if replication works.

          Show
          Ted Yu added a comment - Please read through the test output of 0.92 builds 217 and 218. With patch 5099.92, the test failure is reproducible on MacBook. Another validation is to deploy patch 5099.92 to real clusters and see if replication works.
          Hide
          Jimmy Xiang added a comment -

          I tried to debug this testcase but it doesn't stop at the changes I did.

          Show
          Jimmy Xiang added a comment - I tried to debug this testcase but it doesn't stop at the changes I did.
          Hide
          Ted Yu added a comment -

          Test scripts from HBASE-4480 would be useful in reproducing the test failure.
          You can run TestReplication#queueFailover in a loop (on different OSes).

          Show
          Ted Yu added a comment - Test scripts from HBASE-4480 would be useful in reproducing the test failure. You can run TestReplication#queueFailover in a loop (on different OSes).
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #219 (See https://builds.apache.org/job/HBase-0.92/219/)
          HBASE-5099 revert due to continuous 0.92 build failures

          tedyu :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
          Show
          Hudson added a comment - Integrated in HBase-0.92 #219 (See https://builds.apache.org/job/HBase-0.92/219/ ) HBASE-5099 revert due to continuous 0.92 build failures tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
          Hide
          Jimmy Xiang added a comment -

          TestReplication#queueFailover has a bug that's why it is flaky:

          https://issues.apache.org/jira/browse/HBASE-5112

          Show
          Jimmy Xiang added a comment - TestReplication#queueFailover has a bug that's why it is flaky: https://issues.apache.org/jira/browse/HBASE-5112
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92-security #54 (See https://builds.apache.org/job/HBase-0.92-security/54/)
          HBASE-5099 revert due to continuous 0.92 build failures
          HBASE-5099 ZK event thread waiting for root region assignment may block server
          shutdown handler for the region sever the root region was on (Jimmy)

          tedyu :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

          tedyu :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
          Show
          Hudson added a comment - Integrated in HBase-0.92-security #54 (See https://builds.apache.org/job/HBase-0.92-security/54/ ) HBASE-5099 revert due to continuous 0.92 build failures HBASE-5099 ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on (Jimmy) tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
          Hide
          Ted Yu added a comment -

          Integrated 5099.92 to 0.92 branch again.

          Show
          Ted Yu added a comment - Integrated 5099.92 to 0.92 branch again.
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92-security #55 (See https://builds.apache.org/job/HBase-0.92-security/55/)
          HBASE-5099 ZK event thread waiting for root region assignment may block server
          shutdown handler for the region sever the root region was on (Jimmy)

          tedyu :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
          Show
          Hudson added a comment - Integrated in HBase-0.92-security #55 (See https://builds.apache.org/job/HBase-0.92-security/55/ ) HBASE-5099 ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on (Jimmy) tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #221 (See https://builds.apache.org/job/HBase-0.92/221/)
          HBASE-5099 ZK event thread waiting for root region assignment may block server
          shutdown handler for the region sever the root region was on (Jimmy)

          tedyu :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
          Show
          Hudson added a comment - Integrated in HBase-0.92 #221 (See https://builds.apache.org/job/HBase-0.92/221/ ) HBASE-5099 ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on (Jimmy) tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-security #56 (See https://builds.apache.org/job/HBase-TRUNK-security/56/)
          HBASE-5099 ZK event thread waiting for root region assignment may block server
          shutdown handler for the region sever the root region was on (Jimmy)

          tedyu :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-security #56 (See https://builds.apache.org/job/HBase-TRUNK-security/56/ ) HBASE-5099 ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on (Jimmy) tedyu : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

            People

            • Assignee:
              Jimmy Xiang
              Reporter:
              Jimmy Xiang
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development