HBase
  1. HBase
  2. HBASE-5081

Distributed log splitting deleteNode races against splitLog retry

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.92.0, 0.94.0
    • Fix Version/s: 0.92.0
    • Component/s: wal
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot.
      I looked into it and here is what happened I think:

      1. One rs died, the servershutdownhandler found it out and started the distributed log splitting;
      2. All three tasks failed, so the three tasks were deleted, asynchronously;
      3. Servershutdownhandler retried the log splitting;
      4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks);
      5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one
      task in the hashmap;
      6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not
      in the hashmap, so it created a new orphan task.
      7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short,
      so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen.

      So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have
      a clean start.

      Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen
      soon enough, some node created during the retrial could be deleted.

      deleteNode should be sync.

      1. distributed-log-splitting-screenshot.png
        58 kB
        Jimmy Xiang
      2. patch_for_92.txt
        1 kB
        Jimmy Xiang
      3. patch_for_92_v2.txt
        2 kB
        Jimmy Xiang
      4. patch_for_92_v3.txt
        2 kB
        Jimmy Xiang
      5. hbase-5081_patch_for_92_v4.txt
        2 kB
        Jimmy Xiang
      6. hbase-5081_patch_v5.txt
        2 kB
        Jimmy Xiang
      7. hbase-5081-patch-v6.txt
        2 kB
        Jimmy Xiang
      8. hbase-5081-patch-v7.txt
        4 kB
        Jimmy Xiang
      9. 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
        29 kB
        Prakash Khemani
      10. 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
        29 kB
        Prakash Khemani
      11. 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
        29 kB
        Prakash Khemani
      12. 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
        32 kB
        Prakash Khemani
      13. 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
        33 kB
        Prakash Khemani
      14. 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
        34 kB
        Prakash Khemani
      15. 5081-deleteNode-with-while-loop.txt
        31 kB
        Ted Yu
      16. 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
        38 kB
        Prakash Khemani
      17. HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
        37 kB
        Ted Yu
      18. distributed_log_splitting_screen_shot2.png
        125 kB
        Jimmy Xiang
      19. distributed_log_splitting_screenshot3.png
        286 kB
        Jimmy Xiang

        Issue Links

          Activity

          Hide
          Ted Yu added a comment -

          Nice analysis.

          Show
          Ted Yu added a comment - Nice analysis.
          Hide
          stack added a comment -

          Or we need a means of uniquely identifying the failed split in the hashmap so when the callback runs, it only removes the pertinent tasks if present; i.e. the filename alone is not enough?

          Otherwise, sounds good Jimmy. Good find.

          Show
          stack added a comment - Or we need a means of uniquely identifying the failed split in the hashmap so when the callback runs, it only removes the pertinent tasks if present; i.e. the filename alone is not enough? Otherwise, sounds good Jimmy. Good find.
          Hide
          Ted Yu added a comment -

          To expedite the release of 0.92, I think we can make deleteNode synchronous first.

          Once distributed log splitting works robustly, we can implement async deleteNode in a follow on JIRA.

          Show
          Ted Yu added a comment - To expedite the release of 0.92, I think we can make deleteNode synchronous first. Once distributed log splitting works robustly, we can implement async deleteNode in a follow on JIRA.
          Hide
          Lars Hofhansl added a comment -

          +1 on synchronous short term solution. Long term (trunk?) we can do what stack says.
          Wanna take a stab at a patch Jimmy?

          Show
          Lars Hofhansl added a comment - +1 on synchronous short term solution. Long term (trunk?) we can do what stack says. Wanna take a stab at a patch Jimmy?
          Hide
          Jimmy Xiang added a comment -

          I am working on a path now. I think synchronous deleteNode is clean. It will give retry a fresh start.
          But it may take a while if there are too many files. Yes, for long term, we can think about how to do what stack says.

          Show
          Jimmy Xiang added a comment - I am working on a path now. I think synchronous deleteNode is clean. It will give retry a fresh start. But it may take a while if there are too many files. Yes, for long term, we can think about how to do what stack says.
          Hide
          Jimmy Xiang added a comment -

          Can we deleteNode only if it is successfully done? If it is not completed, let the node stay there. In this case, when the retry happens, it should see the old node there, but it is ok. The new task in the hashmap won't be deleted either.

          Show
          Jimmy Xiang added a comment - Can we deleteNode only if it is successfully done? If it is not completed, let the node stay there. In this case, when the retry happens, it should see the old node there, but it is ok. The new task in the hashmap won't be deleted either.
          Hide
          Jimmy Xiang added a comment -

          This is a patch for 92.

          Show
          Jimmy Xiang added a comment - This is a patch for 92.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/
          -----------------------------------------------------------

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary
          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.
          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs


          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f

          Diff: https://reviews.apache.org/r/3292/diff

          Testing
          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          Jimmy Xiang added a comment -

          Updated the comments.

          Show
          Jimmy Xiang added a comment - Updated the comments.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4046
          -----------------------------------------------------------

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9151>

          White space on end and 'is' should be 'if'

          • Michael

          On 2011-12-21 17:01:22, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-21 17:01:22)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4046 ----------------------------------------------------------- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9151 > White space on end and 'is' should be 'if' Michael On 2011-12-21 17:01:22, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-21 17:01:22) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4045
          -----------------------------------------------------------

          Ship it!

          Looks good to me. How I know it works? Would it be hard to write a test?

          • Michael

          On 2011-12-21 17:01:22, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-21 17:01:22)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4045 ----------------------------------------------------------- Ship it! Looks good to me. How I know it works? Would it be hard to write a test? Michael On 2011-12-21 17:01:22, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-21 17:01:22) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/
          -----------------------------------------------------------

          (Updated 2011-12-21 17:12:36.185307)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Changes
          -------

          Fixed the comments.

          Summary
          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.
          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs (updated)


          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f

          Diff: https://reviews.apache.org/r/3292/diff

          Testing
          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-21 17:12:36.185307) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Changes ------- Fixed the comments. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs (updated) src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-12-21 17:05:47, Michael Stack wrote:

          > Looks good to me. How I know it works? Would it be hard to write a test?

          This is a race issue hard to reproduce. Let me think about how to come up a unit test.

          • Jimmy

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4045
          -----------------------------------------------------------

          On 2011-12-21 17:12:36, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-21 17:12:36)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-12-21 17:05:47, Michael Stack wrote: > Looks good to me. How I know it works? Would it be hard to write a test? This is a race issue hard to reproduce. Let me think about how to come up a unit test. Jimmy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4045 ----------------------------------------------------------- On 2011-12-21 17:12:36, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-21 17:12:36) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4050
          -----------------------------------------------------------

          Interesting idea. Let us know result of testing in real cluster.
          Since the zk node may live across cluster restart, we should verify the change works in that scenario.

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9158>

          I think the new comments should be lifted to line 353 for clarity.

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9156>

          Should read 'if the task failed'

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9157>

          Indentation should be 2 spaces.

          • Ted

          On 2011-12-21 17:12:36, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-21 17:12:36)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4050 ----------------------------------------------------------- Interesting idea. Let us know result of testing in real cluster. Since the zk node may live across cluster restart, we should verify the change works in that scenario. src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9158 > I think the new comments should be lifted to line 353 for clarity. src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9156 > Should read 'if the task failed' src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9157 > Indentation should be 2 spaces. Ted On 2011-12-21 17:12:36, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-21 17:12:36) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/
          -----------------------------------------------------------

          (Updated 2011-12-21 17:40:59.028814)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Changes
          -------

          Fixed the indentation issue, added more comment.

          Summary
          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.
          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs (updated)


          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          Diff: https://reviews.apache.org/r/3292/diff

          Testing
          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-21 17:40:59.028814) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Changes ------- Fixed the indentation issue, added more comment. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs (updated) src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          Jimmy Xiang added a comment -

          The patch is for both 0.92 and 0.94 actually.

          Show
          Jimmy Xiang added a comment - The patch is for both 0.92 and 0.94 actually.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4052
          -----------------------------------------------------------

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9159>

          Should read 'Asynchronously deleting'

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9160>

          Should read 'if the task failed and was not an orphan'

          • Ted

          On 2011-12-21 17:40:59, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-21 17:40:59)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4052 ----------------------------------------------------------- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9159 > Should read 'Asynchronously deleting' src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9160 > Should read 'if the task failed and was not an orphan' Ted On 2011-12-21 17:40:59, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-21 17:40:59) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/
          -----------------------------------------------------------

          (Updated 2011-12-21 18:06:55.460515)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Changes
          -------

          Changed some comments

          Summary
          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.
          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs (updated)


          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          Diff: https://reviews.apache.org/r/3292/diff

          Testing
          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-21 18:06:55.460515) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Changes ------- Changed some comments Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs (updated) src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-12-21 17:44:53, Ted Yu wrote:

          > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 368

          > <https://reviews.apache.org/r/3292/diff/4/?file=65674#file65674line368>

          >

          > Should read 'if the task failed and was not an orphan'

          The original is fine.

          • Jimmy

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4052
          -----------------------------------------------------------

          On 2011-12-21 18:06:55, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-21 18:06:55)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-12-21 17:44:53, Ted Yu wrote: > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 368 > < https://reviews.apache.org/r/3292/diff/4/?file=65674#file65674line368 > > > Should read 'if the task failed and was not an orphan' The original is fine. Jimmy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4052 ----------------------------------------------------------- On 2011-12-21 18:06:55, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-21 18:06:55) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          Ted Yu added a comment -

          I can reproduce the following test failures on MacBook:

          Failed tests:   testTaskErr(org.apache.hadoop.hbase.master.TestSplitLogManager)
            testUnassignedTimeout(org.apache.hadoop.hbase.master.TestSplitLogManager)
          

          Please adjust TestSplitLogManager accordingly.

          Show
          Ted Yu added a comment - I can reproduce the following test failures on MacBook: Failed tests: testTaskErr(org.apache.hadoop.hbase.master.TestSplitLogManager) testUnassignedTimeout(org.apache.hadoop.hbase.master.TestSplitLogManager) Please adjust TestSplitLogManager accordingly.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4056
          -----------------------------------------------------------

          Ship it!

          Looks good to me. Nice simple fix.
          According to the jira testTaskErr is failing, probably needs to be adjusted.

          • Lars

          On 2011-12-21 18:06:55, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-21 18:06:55)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4056 ----------------------------------------------------------- Ship it! Looks good to me. Nice simple fix. According to the jira testTaskErr is failing, probably needs to be adjusted. Lars On 2011-12-21 18:06:55, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-21 18:06:55) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          Jimmy Xiang added a comment -

          I am thinking about sync delete for the failure case. What do you think?

          I am adjusting the test failure now.

          Show
          Jimmy Xiang added a comment - I am thinking about sync delete for the failure case. What do you think? I am adjusting the test failure now.
          Hide
          Jimmy Xiang added a comment -

          Fixed the unit test.

          Show
          Jimmy Xiang added a comment - Fixed the unit test.
          Hide
          Ted Yu added a comment -

          TestSplitLogManager passed with patch v6.

          Please upload patch v6 to review board.

          +    assertTrue(ZKUtil.checkExists(zkw, tasknode) != -1);
          

          The above assertion can be satisfied when there is no race, right ?

          Show
          Ted Yu added a comment - TestSplitLogManager passed with patch v6. Please upload patch v6 to review board. + assertTrue(ZKUtil.checkExists(zkw, tasknode) != -1); The above assertion can be satisfied when there is no race, right ?
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/
          -----------------------------------------------------------

          (Updated 2011-12-21 21:30:56.293535)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Changes
          -------

          This one adjusted the TestSplitLogManager since we don't delete node in case of task failure.

          Summary
          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.
          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs (updated)


          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f
          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 5c9d7dd

          Diff: https://reviews.apache.org/r/3292/diff

          Testing
          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-21 21:30:56.293535) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Changes ------- This one adjusted the TestSplitLogManager since we don't delete node in case of task failure. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs (updated) src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 5c9d7dd Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          Jimmy Xiang added a comment -

          Upload patch v6 to review board.

          + assertTrue(ZKUtil.checkExists(zkw, tasknode) != -1);

          The original is this:

          + assertTrue(ZKUtil.checkExists(zkw, tasknode) == -1);

          This assertion can be satisfied when there is no race without this patch.
          With this patch, we don't delete any failed task node. So the node should be
          there now.

          I am still working on a patch to delete the node synchronously in this scenario.

          Show
          Jimmy Xiang added a comment - Upload patch v6 to review board. + assertTrue(ZKUtil.checkExists(zkw, tasknode) != -1); The original is this: + assertTrue(ZKUtil.checkExists(zkw, tasknode) == -1); This assertion can be satisfied when there is no race without this patch. With this patch, we don't delete any failed task node. So the node should be there now. I am still working on a patch to delete the node synchronously in this scenario.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4059
          -----------------------------------------------------------

          Test change looks good to me.

          • Lars

          On 2011-12-21 21:30:56, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-21 21:30:56)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 5c9d7dd

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4059 ----------------------------------------------------------- Test change looks good to me. Lars On 2011-12-21 21:30:56, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-21 21:30:56) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 7b7316f src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 5c9d7dd Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          Ted Yu added a comment -

          Do we have a test case where zk nodes (including the failed task node) survive cluster restart and distributed log splitting successfully finishes after cluster restart ?

          If we use patch v6 instead of the synchronous node deletion patch (upcoming), we should check the value of resubmit_threshold. If the value is 0, we should still delete the failed task node because there would be no resubmission.

          Show
          Ted Yu added a comment - Do we have a test case where zk nodes (including the failed task node) survive cluster restart and distributed log splitting successfully finishes after cluster restart ? If we use patch v6 instead of the synchronous node deletion patch (upcoming), we should check the value of resubmit_threshold. If the value is 0, we should still delete the failed task node because there would be no resubmission.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/
          -----------------------------------------------------------

          (Updated 2011-12-22 00:25:32.299870)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Changes
          -------

          This patch deletes the failed task node synchronously. I also modified a unit test to test a scenario where a failed task is not deleted properly.

          Summary
          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.
          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs (updated)


          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1
          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing
          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:25:32.299870) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Changes ------- This patch deletes the failed task node synchronously. I also modified a unit test to test a scenario where a failed task is not deleted properly. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs (updated) src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/
          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23.167537)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Changes
          -------

          Removed whitespaces.

          Summary
          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.
          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs (updated)


          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1
          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing
          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23.167537) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Changes ------- Removed whitespaces. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs (updated) src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-12-21 17:05:47, Michael Stack wrote:

          > Looks good to me. How I know it works? Would it be hard to write a test?

          Jimmy Xiang wrote:

          This is a race issue hard to reproduce. Let me think about how to come up a unit test.

          Jimmy. Please paste your last patch to JIRA. I'd like to include in our 0.92RC though it doesn't have a test.

          • Michael

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4045
          -----------------------------------------------------------

          On 2011-12-22 00:31:23, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-12-21 17:05:47, Michael Stack wrote: > Looks good to me. How I know it works? Would it be hard to write a test? Jimmy Xiang wrote: This is a race issue hard to reproduce. Let me think about how to come up a unit test. Jimmy. Please paste your last patch to JIRA. I'd like to include in our 0.92RC though it doesn't have a test. Michael ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4045 ----------------------------------------------------------- On 2011-12-22 00:31:23, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-12-21 17:05:47, Michael Stack wrote:

          > Looks good to me. How I know it works? Would it be hard to write a test?

          Jimmy Xiang wrote:

          This is a race issue hard to reproduce. Let me think about how to come up a unit test.

          Michael Stack wrote:

          Jimmy. Please paste your last patch to JIRA. I'd like to include in our 0.92RC though it doesn't have a test.

          Yes, it is in Jira: hbase-5081-patch-v7.txt
          Thanks.

          • Jimmy

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4045
          -----------------------------------------------------------

          On 2011-12-22 00:31:23, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-12-21 17:05:47, Michael Stack wrote: > Looks good to me. How I know it works? Would it be hard to write a test? Jimmy Xiang wrote: This is a race issue hard to reproduce. Let me think about how to come up a unit test. Michael Stack wrote: Jimmy. Please paste your last patch to JIRA. I'd like to include in our 0.92RC though it doesn't have a test. Yes, it is in Jira: hbase-5081-patch-v7.txt Thanks. Jimmy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4045 ----------------------------------------------------------- On 2011-12-22 00:31:23, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4064
          -----------------------------------------------------------

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9163>

          Did the first approach not work?

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9164>

          It seems we can do this unconditionally now (no need for the safeToDeleteNodeAsync flag). The worst scenario is trying to remove a node that has already been removed

          • Lars

          On 2011-12-22 00:31:23, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4064 ----------------------------------------------------------- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9163 > Did the first approach not work? src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9164 > It seems we can do this unconditionally now (no need for the safeToDeleteNodeAsync flag). The worst scenario is trying to remove a node that has already been removed Lars On 2011-12-22 00:31:23, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          stack added a comment -

          Committed branch and trunk. Thanks for the patch Jimmy.

          Show
          stack added a comment - Committed branch and trunk. Thanks for the patch Jimmy.
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-12-22 00:58:56, Lars Hofhansl wrote:

          > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 373

          > <https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line373>

          >

          > It seems we can do this unconditionally now (no need for the safeToDeleteNodeAsync flag). The worst scenario is trying to remove a node that has already been removed

          If the failed node is not removed at the beginning, we could run into the same race issue again.

          On 2011-12-22 00:58:56, Lars Hofhansl wrote:

          > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 354

          > <https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line354>

          >

          > Did the first approach not work?

          The first approach will not hang the splitLog method, but the failed tasks won't be actually tried again since the state stays in TaskState.TASK_ERR. We do need to delete those nodes unless we put different data in zookeeper as Stack suggested.

          • Jimmy

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4064
          -----------------------------------------------------------

          On 2011-12-22 00:31:23, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-12-22 00:58:56, Lars Hofhansl wrote: > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 373 > < https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line373 > > > It seems we can do this unconditionally now (no need for the safeToDeleteNodeAsync flag). The worst scenario is trying to remove a node that has already been removed If the failed node is not removed at the beginning, we could run into the same race issue again. On 2011-12-22 00:58:56, Lars Hofhansl wrote: > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 354 > < https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line354 > > > Did the first approach not work? The first approach will not hang the splitLog method, but the failed tasks won't be actually tried again since the state stays in TaskState.TASK_ERR. We do need to delete those nodes unless we put different data in zookeeper as Stack suggested. Jimmy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4064 ----------------------------------------------------------- On 2011-12-22 00:31:23, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          stack added a comment -

          Reopening.

          I committed though there was review still going on. Backed out change from trunk and 0.92.

          Show
          stack added a comment - Reopening. I committed though there was review still going on. Backed out change from trunk and 0.92.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4067
          -----------------------------------------------------------

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9167>

          The deletion is immediate. Should this counter be incremented ?

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9168>

          I think we should be more cautious because RecoverableZooKeeper has attempted retry.
          ke should be rethrown.

          • Ted

          On 2011-12-22 00:31:23, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4067 ----------------------------------------------------------- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9167 > The deletion is immediate. Should this counter be incremented ? src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9168 > I think we should be more cautious because RecoverableZooKeeper has attempted retry. ke should be rethrown. Ted On 2011-12-22 00:31:23, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-12-22 02:23:19, Ted Yu wrote:

          > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 380

          > <https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line380>

          >

          > The deletion is immediate. Should this counter be incremented ?

          Yes, so that we can track how many deletions succeed, how many fail.

          On 2011-12-22 02:23:19, Ted Yu wrote:

          > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 386

          > <https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line386>

          >

          > I think we should be more cautious because RecoverableZooKeeper has attempted retry.

          > ke should be rethrown.

          In this case, we should not re-throw it actually. In the corresponding asynchronous deleteNode method, it doesn't throw KeeperException either. It just logs the failure.

          • Jimmy

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4067
          -----------------------------------------------------------

          On 2011-12-22 00:31:23, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-12-22 02:23:19, Ted Yu wrote: > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 380 > < https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line380 > > > The deletion is immediate. Should this counter be incremented ? Yes, so that we can track how many deletions succeed, how many fail. On 2011-12-22 02:23:19, Ted Yu wrote: > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 386 > < https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line386 > > > I think we should be more cautious because RecoverableZooKeeper has attempted retry. > ke should be rethrown. In this case, we should not re-throw it actually. In the corresponding asynchronous deleteNode method, it doesn't throw KeeperException either. It just logs the failure. Jimmy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4067 ----------------------------------------------------------- On 2011-12-22 00:31:23, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4069
          -----------------------------------------------------------

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9171>

          Looking at current code:
          private void deleteNodeFailure(String path) {
          LOG.fatal("logic failure, failing to delete a node should never happen " +
          "because delete has infinite retries");

          since the above failure should never happen, the correct action would be to rethrown the exception - for both synchronous and async cases.

          • Ted

          On 2011-12-22 00:31:23, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4069 ----------------------------------------------------------- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9171 > Looking at current code: private void deleteNodeFailure(String path) { LOG.fatal("logic failure, failing to delete a node should never happen " + "because delete has infinite retries"); since the above failure should never happen, the correct action would be to rethrown the exception - for both synchronous and async cases. Ted On 2011-12-22 00:31:23, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          stack added a comment -

          Ugh. Sorry to whoever is trying to follow along. I applied the patch a few hours ago and then backed it out because it looked like ongoing review still (Lars had comments). I then just reapplied it because Lars gave +1 but looks like Ted and Jimmy are still chatting above so I backed it out again for now.

          Show
          stack added a comment - Ugh. Sorry to whoever is trying to follow along. I applied the patch a few hours ago and then backed it out because it looked like ongoing review still (Lars had comments). I then just reapplied it because Lars gave +1 but looks like Ted and Jimmy are still chatting above so I backed it out again for now.
          Hide
          Ted Yu added a comment -

          I can understand the following code from HLogSplitter.java:

                  LOG.fatal(this.getName() + " Got while writing log entry to log", e);
                  throw e;
          

          But not this code from SplitLogWorker.java:

              } catch (KeeperException.NoNodeException e) {
                LOG.fatal("logic error - end task " + path + " " + ts +
                    " failed because task doesn't exist", e);
              } catch (KeeperException e) {
          

          I think we should add rethrowing in the above case for NoNodeException in this JIRA.

          Show
          Ted Yu added a comment - I can understand the following code from HLogSplitter.java: LOG.fatal( this .getName() + " Got while writing log entry to log" , e); throw e; But not this code from SplitLogWorker.java: } catch (KeeperException.NoNodeException e) { LOG.fatal( "logic error - end task " + path + " " + ts + " failed because task doesn't exist" , e); } catch (KeeperException e) { I think we should add rethrowing in the above case for NoNodeException in this JIRA.
          Hide
          Ted Yu added a comment -

          I found 4 other occasions of LOG.fatal() without (re)throwing in SplitLogManager.java, such as:

              public void setBatch(TaskBatch batch) {
                if (batch != null && this.batch != null) {
                  LOG.fatal("logic error - batch being overwritten");
                }
          

          I think these 4 places should throw proper exception.

          Show
          Ted Yu added a comment - I found 4 other occasions of LOG.fatal() without (re)throwing in SplitLogManager.java, such as: public void setBatch(TaskBatch batch) { if (batch != null && this .batch != null ) { LOG.fatal( "logic error - batch being overwritten" ); } I think these 4 places should throw proper exception.
          Hide
          stack added a comment -

          Above comments seem a little outside the scope of this issue Zhihong?

          Show
          stack added a comment - Above comments seem a little outside the scope of this issue Zhihong?
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-security #40 (See https://builds.apache.org/job/HBase-TRUNK-security/40/)
          HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REVERT – COMMITTED BEFORE REVIEW FINISHED – AGAINpatch -p0 -R < x.txt patch -p0 -R < x.txt
          HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REAPPLY
          HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REVERT – COMMITTED BEFORE REVIEW FINISHED
          HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry

          stack :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java

          stack :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java

          stack :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java

          stack :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-security #40 (See https://builds.apache.org/job/HBase-TRUNK-security/40/ ) HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REVERT – COMMITTED BEFORE REVIEW FINISHED – AGAINpatch -p0 -R < x.txt patch -p0 -R < x.txt HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REAPPLY HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REVERT – COMMITTED BEFORE REVIEW FINISHED HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry stack : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java stack : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java stack : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java stack : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92-security #48 (See https://builds.apache.org/job/HBase-0.92-security/48/)
          HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REVERT – COMMITTED BEFORE REVIEW FINISHED – AGAIN
          HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REAPPLY
          HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REVERT – COMMITTED BEFORE REVIEW FINISHED
          HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry

          stack :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java

          stack :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java

          stack :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java

          stack :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Show
          Hudson added a comment - Integrated in HBase-0.92-security #48 (See https://builds.apache.org/job/HBase-0.92-security/48/ ) HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REVERT – COMMITTED BEFORE REVIEW FINISHED – AGAIN HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REAPPLY HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REVERT – COMMITTED BEFORE REVIEW FINISHED HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry stack : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java stack : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java stack : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java stack : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #209 (See https://builds.apache.org/job/HBase-0.92/209/)
          HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REVERT – COMMITTED BEFORE REVIEW FINISHED – AGAIN
          HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REAPPLY

          stack :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java

          stack :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Show
          Hudson added a comment - Integrated in HBase-0.92 #209 (See https://builds.apache.org/job/HBase-0.92/209/ ) HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REVERT – COMMITTED BEFORE REVIEW FINISHED – AGAIN HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REAPPLY stack : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java stack : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2568 (See https://builds.apache.org/job/HBase-TRUNK/2568/)
          HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REVERT – COMMITTED BEFORE REVIEW FINISHED – AGAINpatch -p0 -R < x.txt patch -p0 -R < x.txt
          HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REAPPLY

          stack :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java

          stack :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2568 (See https://builds.apache.org/job/HBase-TRUNK/2568/ ) HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REVERT – COMMITTED BEFORE REVIEW FINISHED – AGAINpatch -p0 -R < x.txt patch -p0 -R < x.txt HBASE-5081 Distributed log splitting deleteNode races againsth splitLog retry; REAPPLY stack : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java stack : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4073
          -----------------------------------------------------------

          This patch seems to be raising too many questions. Should we try going other route of ensuring the async delete removes the 'right' task?

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9172>

          Why check for success up here rather than down inside the synchronize on task.batch? Why not do this safeToDeleteNodeAsync in there in the else clause where we up the count of errors? Is it not safe to do the delete of zk node NOW under the synchronize block?

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9174>

          Are we duplicating the code from deleteNode here? Should we have sync/async versions?

          • Michael

          On 2011-12-22 00:31:23, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4073 ----------------------------------------------------------- This patch seems to be raising too many questions. Should we try going other route of ensuring the async delete removes the 'right' task? src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9172 > Why check for success up here rather than down inside the synchronize on task.batch? Why not do this safeToDeleteNodeAsync in there in the else clause where we up the count of errors? Is it not safe to do the delete of zk node NOW under the synchronize block? src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9174 > Are we duplicating the code from deleteNode here? Should we have sync/async versions? Michael On 2011-12-22 00:31:23, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4076
          -----------------------------------------------------------

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9178>

          Since the value of status won't change, I think it is better to call deleteNodeNow() here.
          If we call deleteNodeNow() at line 360, we hold the lock much longer.

          • Ted

          On 2011-12-22 00:31:23, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4076 ----------------------------------------------------------- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9178 > Since the value of status won't change, I think it is better to call deleteNodeNow() here. If we call deleteNodeNow() at line 360, we hold the lock much longer. Ted On 2011-12-22 00:31:23, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4077
          -----------------------------------------------------------

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9179>

          DeleteAsyncCallback is only used by deleteNode().
          I think we should simplify logic by removing deleteNode() and DeleteAsyncCallback - deleteNodeNow() uses RecoverableZooKeeper which has the retry logic.

          • Ted

          On 2011-12-22 00:31:23, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4077 ----------------------------------------------------------- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9179 > DeleteAsyncCallback is only used by deleteNode(). I think we should simplify logic by removing deleteNode() and DeleteAsyncCallback - deleteNodeNow() uses RecoverableZooKeeper which has the retry logic. Ted On 2011-12-22 00:31:23, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-12-22 15:24:02, Ted Yu wrote:

          > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 349

          > <https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line349>

          >

          > Since the value of status won't change, I think it is better to call deleteNodeNow() here.

          > If we call deleteNodeNow() at line 360, we hold the lock much longer.

          Right.

          • Jimmy

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4076
          -----------------------------------------------------------

          On 2011-12-22 00:31:23, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-12-22 15:24:02, Ted Yu wrote: > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 349 > < https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line349 > > > Since the value of status won't change, I think it is better to call deleteNodeNow() here. > If we call deleteNodeNow() at line 360, we hold the lock much longer. Right. Jimmy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4076 ----------------------------------------------------------- On 2011-12-22 00:31:23, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-12-22 15:34:13, Ted Yu wrote:

          > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 373

          > <https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line373>

          >

          > DeleteAsyncCallback is only used by deleteNode().

          > I think we should simplify logic by removing deleteNode() and DeleteAsyncCallback - deleteNodeNow() uses RecoverableZooKeeper which has the retry logic.

          The difference is that deleteNode has unlimited retries. RecoverableZooKeeper doesn't. It has only 3 retries by default.

          • Jimmy

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4077
          -----------------------------------------------------------

          On 2011-12-22 00:31:23, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-12-22 15:34:13, Ted Yu wrote: > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 373 > < https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line373 > > > DeleteAsyncCallback is only used by deleteNode(). > I think we should simplify logic by removing deleteNode() and DeleteAsyncCallback - deleteNodeNow() uses RecoverableZooKeeper which has the retry logic. The difference is that deleteNode has unlimited retries. RecoverableZooKeeper doesn't. It has only 3 retries by default. Jimmy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4077 ----------------------------------------------------------- On 2011-12-22 00:31:23, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-12-22 15:08:38, Michael Stack wrote:

          > This patch seems to be raising too many questions. Should we try going other route of ensuring the async delete removes the 'right' task?

          Due to the race issue, we have to put more than the "filename" in the node and the hashmap, so as to removes the right task.
          That's much bigger change and will raise more questions.

          On 2011-12-22 15:08:38, Michael Stack wrote:

          > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 349

          > <https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line349>

          >

          > Why check for success up here rather than down inside the synchronize on task.batch? Why not do this safeToDeleteNodeAsync in there in the else clause where we up the count of errors? Is it not safe to do the delete of zk node NOW under the synchronize block?

          It is safe to do the delete under the synchronize block. The reason is that I don't want to hold the lock on task.batch while delete the node synchronously.

          On 2011-12-22 15:08:38, Michael Stack wrote:

          > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 378

          > <https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line378>

          >

          > Are we duplicating the code from deleteNode here? Should we have sync/async versions?

          deleteNode is the async version. deleteNodeNow is the sync version. The async version can have unlimited retries. The sync version can retry up to certain configured number (3 by default).
          So the sync version doesn't guarantee it will be deleted. The code wise, it's hard to reuse.

          • Jimmy

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4073
          -----------------------------------------------------------

          On 2011-12-22 00:31:23, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-12-22 15:08:38, Michael Stack wrote: > This patch seems to be raising too many questions. Should we try going other route of ensuring the async delete removes the 'right' task? Due to the race issue, we have to put more than the "filename" in the node and the hashmap, so as to removes the right task. That's much bigger change and will raise more questions. On 2011-12-22 15:08:38, Michael Stack wrote: > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 349 > < https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line349 > > > Why check for success up here rather than down inside the synchronize on task.batch? Why not do this safeToDeleteNodeAsync in there in the else clause where we up the count of errors? Is it not safe to do the delete of zk node NOW under the synchronize block? It is safe to do the delete under the synchronize block. The reason is that I don't want to hold the lock on task.batch while delete the node synchronously. On 2011-12-22 15:08:38, Michael Stack wrote: > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 378 > < https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line378 > > > Are we duplicating the code from deleteNode here? Should we have sync/async versions? deleteNode is the async version. deleteNodeNow is the sync version. The async version can have unlimited retries. The sync version can retry up to certain configured number (3 by default). So the sync version doesn't guarantee it will be deleted. The code wise, it's hard to reuse. Jimmy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4073 ----------------------------------------------------------- On 2011-12-22 00:31:23, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          stack added a comment -

          @Jimmy Another tack would be ensuring splitLogDistributed has cleaned-up after itself before it returns including clean up on early-out because of exception. It seems like we will rerun the split if we early-out if OrphanHLogAfterSplitException is thrown ONLY (Is this what happened in your scenario? You say three log splits failed? Was it because a new log file showed up: i.e. OrphanHLogAfterSplitException? Or for some other reason? If for some other reason, the split should have failed?). I'd think that if a new file shows up while we were splitting, its fine to redo the split but I'd think that splitLogDistibuted would make sure it'd cleaned up after itself before it returned... that it had completed the batch it had been asked do.

          I was waiting on this issue to be done before cutting the RC but after looking at the pieces, I think that while this an important issue, my thinking is that it rare so I won't hold up the RC.

          Good stuff.

          Show
          stack added a comment - @Jimmy Another tack would be ensuring splitLogDistributed has cleaned-up after itself before it returns including clean up on early-out because of exception. It seems like we will rerun the split if we early-out if OrphanHLogAfterSplitException is thrown ONLY (Is this what happened in your scenario? You say three log splits failed? Was it because a new log file showed up: i.e. OrphanHLogAfterSplitException? Or for some other reason? If for some other reason, the split should have failed?). I'd think that if a new file shows up while we were splitting, its fine to redo the split but I'd think that splitLogDistibuted would make sure it'd cleaned up after itself before it returned... that it had completed the batch it had been asked do. I was waiting on this issue to be done before cutting the RC but after looking at the pieces, I think that while this an important issue, my thinking is that it rare so I won't hold up the RC. Good stuff.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4089
          -----------------------------------------------------------

          I feel that the proper fix should go in the method createTaskIfAbsent()

          before attempting to delete a task in zk, task.deleted is set to true. The task is not removed from tasks array until task is successfully removed from zk.

          In createTaskIfAbsent() when you find a deleted task we should do the following

          • If the task had completed successfully then return null. (It is as if the task is completed right away).
          • if the task had completed unsuccessfully then block (with timeouts) until the task is removed from the tasks array.

          Without fixing anything, the problem, I think is present only in the following scenario

          • at startup the master acquires orphan tasks listed in zookeeper. One of these orphan tasks fails. Before that orphan task could be deleted some master thread asks for that task to be completed. As things currently stand, the SplitLogManager will reply with SUCCESS immediately. (This is because of the logic in createTaskIfAbsent())

          The common case where this race happens should work ...

          • a master thread asks for a log dir to be split. That task fails but it has not been deleted from zk yet nor removed from tasks yet. The log-dir-split is retried and the retry finds the old, soon to be deleted task. But the retry will also see that task.batch is set and it will immediately throw an error saying 'someone else is waiting for this task'. And the next time log-dir-split is retried the tasks map might have been cleared and things will work.

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9193>

          The task corresponding to this path has to be removed from the tasks map (as in deleteNodeSuccess())

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          <https://reviews.apache.org/r/3292/#comment9194>

          I guess this should be considered an error that the delete did not go through?

          • Prakash

          On 2011-12-22 00:31:23, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4089 ----------------------------------------------------------- I feel that the proper fix should go in the method createTaskIfAbsent() before attempting to delete a task in zk, task.deleted is set to true. The task is not removed from tasks array until task is successfully removed from zk. In createTaskIfAbsent() when you find a deleted task we should do the following If the task had completed successfully then return null. (It is as if the task is completed right away). if the task had completed unsuccessfully then block (with timeouts) until the task is removed from the tasks array. Without fixing anything, the problem, I think is present only in the following scenario at startup the master acquires orphan tasks listed in zookeeper. One of these orphan tasks fails. Before that orphan task could be deleted some master thread asks for that task to be completed. As things currently stand, the SplitLogManager will reply with SUCCESS immediately. (This is because of the logic in createTaskIfAbsent()) The common case where this race happens should work ... a master thread asks for a log dir to be split. That task fails but it has not been deleted from zk yet nor removed from tasks yet. The log-dir-split is retried and the retry finds the old, soon to be deleted task. But the retry will also see that task.batch is set and it will immediately throw an error saying 'someone else is waiting for this task'. And the next time log-dir-split is retried the tasks map might have been cleared and things will work. src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9193 > The task corresponding to this path has to be removed from the tasks map (as in deleteNodeSuccess()) src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java < https://reviews.apache.org/r/3292/#comment9194 > I guess this should be considered an error that the delete did not go through? Prakash On 2011-12-22 00:31:23, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          Ted Yu added a comment -

          @Prakash:
          Thanks for weighing in.

          If possible, do you want to show us your proposal through a patch so that our discussion is more concrete ?

          @Jimmy:
          Prakash's analysis resonates with Stack's comments about OrphanHLogAfterSplitException.
          Is it possible for you to collect master and region server logs so that we can verify the analysis ?

          Thanks

          Show
          Ted Yu added a comment - @Prakash: Thanks for weighing in. If possible, do you want to show us your proposal through a patch so that our discussion is more concrete ? @Jimmy: Prakash's analysis resonates with Stack's comments about OrphanHLogAfterSplitException. Is it possible for you to collect master and region server logs so that we can verify the analysis ? Thanks
          Hide
          Lars Hofhansl added a comment -

          @Prakash: I like that idea. Keeps the logic where it belongs.

          Show
          Lars Hofhansl added a comment - @Prakash: I like that idea. Keeps the logic where it belongs.
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-12-22 22:55:55, Prakash Khemani wrote:

          > I feel that the proper fix should go in the method createTaskIfAbsent()

          >

          > before attempting to delete a task in zk, task.deleted is set to true. The task is not removed from tasks array until task is successfully removed from zk.

          >

          > In createTaskIfAbsent() when you find a deleted task we should do the following

          > * If the task had completed successfully then return null. (It is as if the task is completed right away).

          > * if the task had completed unsuccessfully then block (with timeouts) until the task is removed from the tasks array.

          >

          > Without fixing anything, the problem, I think is present only in the following scenario

          > - at startup the master acquires orphan tasks listed in zookeeper. One of these orphan tasks fails. Before that orphan task could be deleted some master thread asks for that task to be completed. As things currently stand, the SplitLogManager will reply with SUCCESS immediately. (This is because of the logic in createTaskIfAbsent())

          >

          > The common case where this race happens should work ...

          > - a master thread asks for a log dir to be split. That task fails but it has not been deleted from zk yet nor removed from tasks yet. The log-dir-split is retried and the retry finds the old, soon to be deleted task. But the retry will also see that task.batch is set and it will immediately throw an error saying 'someone else is waiting for this task'. And the next time log-dir-split is retried the tasks map might have been cleared and things will work.

          "The task is not removed from tasks array until task is successfully removed from zk."

          This seems not correct. stopTrackingTasks() will remove all tasks even if the task is not removed from zk.
          That's why createTaskIfAbsent() can put a new task in the set.

          If we remove stopTrackingTasks(), then the task should be still in tasks, then this alternative will work.
          Will removing stopTrackingTasks() cause other issues? For the second *, how long should we block? If
          the task is still not removed from the tasks array after the timeout, what should we do?

          Can you come up a patch? I am very open to any fix.

          On 2011-12-22 22:55:55, Prakash Khemani wrote:

          > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 382

          > <https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line382>

          >

          > The task corresponding to this path has to be removed from the tasks map (as in deleteNodeSuccess())

          It is removed in the stopTrackingTasks() methods, since this one is failed, so batch.installed != batch.done.

          • Jimmy

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3292/#review4089
          -----------------------------------------------------------

          On 2011-12-22 00:31:23, Jimmy Xiang wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3292/

          -----------------------------------------------------------

          (Updated 2011-12-22 00:31:23)

          Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl.

          Summary

          -------

          In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem.

          It used to delete the node always.

          This addresses bug HBASE-5081.

          https://issues.apache.org/jira/browse/HBASE-5081

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1

          src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8

          Diff: https://reviews.apache.org/r/3292/diff

          Testing

          -------

          mvn -Dtest=TestDistributedLogSplitting clean test

          Thanks,

          Jimmy

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-12-22 22:55:55, Prakash Khemani wrote: > I feel that the proper fix should go in the method createTaskIfAbsent() > > before attempting to delete a task in zk, task.deleted is set to true. The task is not removed from tasks array until task is successfully removed from zk. > > In createTaskIfAbsent() when you find a deleted task we should do the following > * If the task had completed successfully then return null. (It is as if the task is completed right away). > * if the task had completed unsuccessfully then block (with timeouts) until the task is removed from the tasks array. > > Without fixing anything, the problem, I think is present only in the following scenario > - at startup the master acquires orphan tasks listed in zookeeper. One of these orphan tasks fails. Before that orphan task could be deleted some master thread asks for that task to be completed. As things currently stand, the SplitLogManager will reply with SUCCESS immediately. (This is because of the logic in createTaskIfAbsent()) > > The common case where this race happens should work ... > - a master thread asks for a log dir to be split. That task fails but it has not been deleted from zk yet nor removed from tasks yet. The log-dir-split is retried and the retry finds the old, soon to be deleted task. But the retry will also see that task.batch is set and it will immediately throw an error saying 'someone else is waiting for this task'. And the next time log-dir-split is retried the tasks map might have been cleared and things will work. "The task is not removed from tasks array until task is successfully removed from zk." This seems not correct. stopTrackingTasks() will remove all tasks even if the task is not removed from zk. That's why createTaskIfAbsent() can put a new task in the set. If we remove stopTrackingTasks(), then the task should be still in tasks, then this alternative will work. Will removing stopTrackingTasks() cause other issues? For the second *, how long should we block? If the task is still not removed from the tasks array after the timeout, what should we do? Can you come up a patch? I am very open to any fix. On 2011-12-22 22:55:55, Prakash Khemani wrote: > src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java, line 382 > < https://reviews.apache.org/r/3292/diff/8/?file=65682#file65682line382 > > > The task corresponding to this path has to be removed from the tasks map (as in deleteNodeSuccess()) It is removed in the stopTrackingTasks() methods, since this one is failed, so batch.installed != batch.done. Jimmy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/#review4089 ----------------------------------------------------------- On 2011-12-22 00:31:23, Jimmy Xiang wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3292/ ----------------------------------------------------------- (Updated 2011-12-22 00:31:23) Review request for hbase, Ted Yu, Michael Stack, and Lars Hofhansl. Summary ------- In this patch, after a task is done, we don't delete the node if the task is failed. So that when it's retried later on, there won't be race problem. It used to delete the node always. This addresses bug HBASE-5081 . https://issues.apache.org/jira/browse/HBASE-5081 Diffs ----- src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java 667a8b1 src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java 32ad7e8 Diff: https://reviews.apache.org/r/3292/diff Testing ------- mvn -Dtest=TestDistributedLogSplitting clean test Thanks, Jimmy
          Hide
          Jimmy Xiang added a comment -

          @Stack, it is not an orphan task. It happens in ServerShutdownHandler. It retries the log splitting if the previous one failed for any reason:

          line 178:
          this.services.getExecutorService().submit(this);

          It keep retrying. Should we have a limit here?

          Show
          Jimmy Xiang added a comment - @Stack, it is not an orphan task. It happens in ServerShutdownHandler. It retries the log splitting if the previous one failed for any reason: line 178: this.services.getExecutorService().submit(this); It keep retrying. Should we have a limit here?
          Hide
          Jimmy Xiang added a comment -

          @Prakash, this one didn't happen when the master starts up. It happened when one region server died.

          Show
          Jimmy Xiang added a comment - @Prakash, this one didn't happen when the master starts up. It happened when one region server died.
          Hide
          Ted Yu added a comment -

          Currently SplitLogManager.Task doesn't have flag indicating whether the deletion was caused by successful splitting.
          I think we should introduce such flag through boolean or enum so that stopTrackingTasks() can make better decision.

          Looking at the code from 0.89-fb, I can see some subtle differences between 0.89-fb and TRUNK.
          e.g. task.batch.notify() in 0.89-fb is guarded by a condition:

                if (!task.isOrphan()) {
                  synchronized (task.batch) {
                    if (status == SUCCESS) {
                      task.batch.done++;
                    } else {
                      task.batch.error++;
                    }
                    if ((task.batch.done + task.batch.error) == task.batch.installed) {
                      task.batch.notify();
                    }
                  }
                }
          

          I think we should unify the two codebases so that our observations have common ground.

          Show
          Ted Yu added a comment - Currently SplitLogManager.Task doesn't have flag indicating whether the deletion was caused by successful splitting. I think we should introduce such flag through boolean or enum so that stopTrackingTasks() can make better decision. Looking at the code from 0.89-fb, I can see some subtle differences between 0.89-fb and TRUNK. e.g. task.batch.notify() in 0.89-fb is guarded by a condition: if (!task.isOrphan()) { synchronized (task.batch) { if (status == SUCCESS) { task.batch.done++; } else { task.batch.error++; } if ((task.batch.done + task.batch.error) == task.batch.installed) { task.batch.notify(); } } } I think we should unify the two codebases so that our observations have common ground.
          Hide
          stack added a comment -

          @Jimmy I see. Its a recursion on ioe; Any ioe could get us stuck here retrying split logs of a downed server. I suppose there should be a bound on how many times we try split logs or a particular server.

          Show
          stack added a comment - @Jimmy I see. Its a recursion on ioe; Any ioe could get us stuck here retrying split logs of a downed server. I suppose there should be a bound on how many times we try split logs or a particular server.
          Hide
          Jimmy Xiang added a comment -

          Prakash is going to put up a patch to fix createIfAbsent, which is great alternative.

          Show
          Jimmy Xiang added a comment - Prakash is going to put up a patch to fix createIfAbsent, which is great alternative.
          Hide
          Ted Yu added a comment -

          From Mahadev:
          ZK does not re order async requests. They will be executed in the
          same order as you call them from the same zk handle.

          Show
          Ted Yu added a comment - From Mahadev: ZK does not re order async requests. They will be executed in the same order as you call them from the same zk handle.
          Hide
          Jean-Daniel Cryans added a comment -

          Fixing the typo in the title.

          Show
          Jean-Daniel Cryans added a comment - Fixing the typo in the title.
          Hide
          Prakash Khemani added a comment -

          I rebased and it rebased cleanly for me. Anyway, uploading the format-patch output again to see if it applies.

          Show
          Prakash Khemani added a comment - I rebased and it rebased cleanly for me. Anyway, uploading the format-patch output again to see if it applies.
          Hide
          Ted Yu added a comment -

          @Prakash:
          Please use '--no-prefix' to generate the patch so that Hadoop QA can apply it using -p0

          Show
          Ted Yu added a comment - @Prakash: Please use '--no-prefix' to generate the patch so that Hadoop QA can apply it using -p0
          Hide
          Prakash Khemani added a comment -

          patch with --no-prefix

          Show
          Prakash Khemani added a comment - patch with --no-prefix
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12509357/0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.replication.TestReplication
          org.apache.hadoop.hbase.mapreduce.TestImportTsv
          org.apache.hadoop.hbase.mapred.TestTableMapReduce
          org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
          org.apache.hadoop.hbase.master.TestSplitLogManager

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/658//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/658//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/658//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12509357/0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplication org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.master.TestSplitLogManager Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/658//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/658//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/658//console This message is automatically generated.
          Hide
          Prakash Khemani added a comment -

          fix an (unrelated) test failure in TestSplitLogManager.testRescanCleanup()

          Show
          Prakash Khemani added a comment - fix an (unrelated) test failure in TestSplitLogManager.testRescanCleanup()
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12509366/0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.mapreduce.TestImportTsv
          org.apache.hadoop.hbase.mapred.TestTableMapReduce
          org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/659//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/659//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/659//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12509366/0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/659//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/659//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/659//console This message is automatically generated.
          Hide
          Jimmy Xiang added a comment -

          @Prakash, this patch seems not fixed the problem. Assume splitLog failed, and the deleteNode is queued up, then the splitLog is retried, then
          createTaskIfAbsent is called, the failed task is still in the tasks map, so it's batch info is updated, batch.installed++, and a createNode is queued up. Now, the deleteNode succeeds and on the deleteNodeSuccess callback, the task is removed from the tasks map. Now it will be an orphan task. batch.installed will never be equal to batch.done + batch.err, so the splitLog will hang.

          Can we delete the task from the tasks map in the deleteNode call before calls zookeeper's delete?

          Show
          Jimmy Xiang added a comment - @Prakash, this patch seems not fixed the problem. Assume splitLog failed, and the deleteNode is queued up, then the splitLog is retried, then createTaskIfAbsent is called, the failed task is still in the tasks map, so it's batch info is updated, batch.installed++, and a createNode is queued up. Now, the deleteNode succeeds and on the deleteNodeSuccess callback, the task is removed from the tasks map. Now it will be an orphan task. batch.installed will never be equal to batch.done + batch.err, so the splitLog will hang. Can we delete the task from the tasks map in the deleteNode call before calls zookeeper's delete?
          Hide
          Prakash Khemani added a comment -

          Assuming splitlog failed, the delete of the zk-task-node is queued up,
          splitlog is retried, createTaskIfAbsent is called, the following piece of
          code in createTaskIfAbsent() will be hit (because oldtask status is
          neither IN_PROGRESS nor SUCCESS. Oldtask status is FAILED)

          LOG.warn("Transient problem. Failure because previously failed task" +
          " state still present. Waiting for znode delete callback" +
          " path=" + path);
          return oldtask;

          The splitlog retry will fail immediately with IOException("duplicate log
          split scheduled for "). The caller (master) will wait and retry again.

          On 1/3/12 7:02 PM, "Jimmy Xiang (Commented) (JIRA)" <jira@apache.org>

          Show
          Prakash Khemani added a comment - Assuming splitlog failed, the delete of the zk-task-node is queued up, splitlog is retried, createTaskIfAbsent is called, the following piece of code in createTaskIfAbsent() will be hit (because oldtask status is neither IN_PROGRESS nor SUCCESS. Oldtask status is FAILED) LOG.warn("Transient problem. Failure because previously failed task" + " state still present. Waiting for znode delete callback" + " path=" + path); return oldtask; The splitlog retry will fail immediately with IOException("duplicate log split scheduled for "). The caller (master) will wait and retry again. On 1/3/12 7:02 PM, "Jimmy Xiang (Commented) (JIRA)" <jira@apache.org>
          Hide
          Ted Yu added a comment - - edited

          In the above scenario, would there be many "duplicate log split scheduled for " messages in master log file ?

          I think the latest patch should go through verification in a real cluster.

          Show
          Ted Yu added a comment - - edited In the above scenario, would there be many "duplicate log split scheduled for " messages in master log file ? I think the latest patch should go through verification in a real cluster.
          Hide
          Jimmy Xiang added a comment -

          @Prakash, cool, that's great.

          splitlog is called by master when it starts up, and the server shutdownhandler when a rs dies.
          The master does wait then retry. However, server shutdownhandler doesn't wait. Can we make it
          wait as the master does? ServerShutdownHandler.process().

          Another thing is the resubmit() method, it's called by multiple threads: the monitor chore thread
          and the ZK event thread. Access to task's member fields should be synchronized, or make them volatile.

          Show
          Jimmy Xiang added a comment - @Prakash, cool, that's great. splitlog is called by master when it starts up, and the server shutdownhandler when a rs dies. The master does wait then retry. However, server shutdownhandler doesn't wait. Can we make it wait as the master does? ServerShutdownHandler.process(). Another thing is the resubmit() method, it's called by multiple threads: the monitor chore thread and the ZK event thread. Access to task's member fields should be synchronized, or make them volatile.
          Hide
          Prakash Khemani added a comment -

          implement feedback

          Show
          Prakash Khemani added a comment - implement feedback
          Hide
          Jimmy Xiang added a comment -

          Cool, thanks. Looks good to me.

          Show
          Jimmy Xiang added a comment - Cool, thanks. Looks good to me.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12509441/0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 77 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.mapreduce.TestImportTsv
          org.apache.hadoop.hbase.mapred.TestTableMapReduce
          org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/665//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/665//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/665//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12509441/0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 77 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/665//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/665//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/665//console This message is automatically generated.
          Hide
          stack added a comment -

          These failures seem unrelated to this patch – they happened on previous hadoopqa build for unrelated patch.

          Show
          stack added a comment - These failures seem unrelated to this patch – they happened on previous hadoopqa build for unrelated patch.
          Hide
          stack added a comment -

          I'll commit this unless objection.

          Show
          stack added a comment - I'll commit this unless objection.
          Hide
          Ted Yu added a comment -
          +        if (oldtask.status == FAILURE) {
          +          // wait for status to change to DELETED
          +          try {
          +            oldtask.wait();
          +          } catch (InterruptedException e) {
          +            LOG.warn("Interrupted when waiting for znode delete callback");
          +            // fall through to return failure
          

          Should a loop be introduced wrapping oldtask.wait() ? JVM sometimes produces spurious notification.

          +        LOG.fatal("Logic error. Deleted task still present in tasks map");
          +        assert false : "Deleted task still present in tasks map";
          

          The assertion wouldn't be effective at runtime, right ?
          I think throwing exception would be better.

          Show
          Ted Yu added a comment - + if (oldtask.status == FAILURE) { + // wait for status to change to DELETED + try { + oldtask.wait(); + } catch (InterruptedException e) { + LOG.warn( "Interrupted when waiting for znode delete callback" ); + // fall through to return failure Should a loop be introduced wrapping oldtask.wait() ? JVM sometimes produces spurious notification. + LOG.fatal( "Logic error. Deleted task still present in tasks map" ); + assert false : "Deleted task still present in tasks map" ; The assertion wouldn't be effective at runtime, right ? I think throwing exception would be better.
          Show
          Ted Yu added a comment - See http://stackoverflow.com/questions/1050592/do-spurious-wakeups-actually-happen
          Hide
          Ted Yu added a comment -

          w.r.t. master retry after the following exception is raised:

                  throw new IOException("duplicate log split scheduled for "
                      + lf.getPath());
          

          I see this code in MasterFileSystem.splitLog():

                try {
                  splitLogSize = splitLogManager.splitLogDistributed(logDirs);
                } catch (OrphanHLogAfterSplitException e) {
                  LOG.warn("Retrying distributed splitting for " + serverNames, e);
                  splitLogManager.splitLogDistributed(logDirs);
                }
          

          Would the retry be carried out as described ?

          Show
          Ted Yu added a comment - w.r.t. master retry after the following exception is raised: throw new IOException( "duplicate log split scheduled for " + lf.getPath()); I see this code in MasterFileSystem.splitLog(): try { splitLogSize = splitLogManager.splitLogDistributed(logDirs); } catch (OrphanHLogAfterSplitException e) { LOG.warn( "Retrying distributed splitting for " + serverNames, e); splitLogManager.splitLogDistributed(logDirs); } Would the retry be carried out as described ?
          Hide
          Ted Yu added a comment -

          About the test suite, from https://builds.apache.org/job/PreCommit-HBASE-Build/665/console:

          Running org.apache.hadoop.hbase.master.TestSplitLogManager
          Running org.apache.hadoop.hbase.regionserver.TestSeekOptimizations
          Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.055 sec
          

          It is a pity that the hung test was masked by known failures.

          MAPREDUCE-3583 seems to have gone into oblivion.

          Show
          Ted Yu added a comment - About the test suite, from https://builds.apache.org/job/PreCommit-HBASE-Build/665/console: Running org.apache.hadoop.hbase.master.TestSplitLogManager Running org.apache.hadoop.hbase.regionserver.TestSeekOptimizations Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.055 sec It is a pity that the hung test was masked by known failures. MAPREDUCE-3583 seems to have gone into oblivion.
          Hide
          Prakash Khemani added a comment -

          ...

          If createTaskIfAbsent() returns a non-null oldtask then that is treated as
          an error by the caller. The caller (splitLogDistributed() will throw an
          IOException)

          If the caller thread is interrupted while waiting for status to change to
          DELETED then createTaskIfAbsent returns errror (I.e. Oldtask) to the
          caller. It is better to return with error than to loop. I missed setting
          the interrupt flag on the thread ... Will do that in the next diff.

          If a task is found in the tasks map even after oldtask's state had changed
          to DELETED then the found task is returned as an error. So it is OK even
          if asserts are not enabled at run time. (This case can actually happen if
          another thread sneaks in and create another task with the same name, but
          it shouldn't happen for log-splitting tasks.)

          On 1/4/12 11:47 AM, "Zhihong Yu (Commented) (JIRA)" <jira@apache.org>
          wrote:

          Show
          Prakash Khemani added a comment - ... If createTaskIfAbsent() returns a non-null oldtask then that is treated as an error by the caller. The caller (splitLogDistributed() will throw an IOException) If the caller thread is interrupted while waiting for status to change to DELETED then createTaskIfAbsent returns errror (I.e. Oldtask) to the caller. It is better to return with error than to loop. I missed setting the interrupt flag on the thread ... Will do that in the next diff. If a task is found in the tasks map even after oldtask's state had changed to DELETED then the found task is returned as an error. So it is OK even if asserts are not enabled at run time. (This case can actually happen if another thread sneaks in and create another task with the same name, but it shouldn't happen for log-splitting tasks.) On 1/4/12 11:47 AM, "Zhihong Yu (Commented) (JIRA)" <jira@apache.org> wrote:
          Hide
          Prakash Khemani added a comment -

          The retry logic is in HMaster.splitLogAfterStartup(). I will remove the
          OrphanLogException handling from MasterFileSystem.

          On 1/4/12 12:03 PM, "Zhihong Yu (Commented) (JIRA)" <jira@apache.org>

          Show
          Prakash Khemani added a comment - The retry logic is in HMaster.splitLogAfterStartup(). I will remove the OrphanLogException handling from MasterFileSystem. On 1/4/12 12:03 PM, "Zhihong Yu (Commented) (JIRA)" <jira@apache.org>
          Hide
          Prakash Khemani added a comment -

          Will look into the test failure. I am not sure I know where to find the
          test run's output logs.

          On 1/4/12 12:35 PM, "Zhihong Yu (Commented) (JIRA)" <jira@apache.org>

          Show
          Prakash Khemani added a comment - Will look into the test failure. I am not sure I know where to find the test run's output logs. On 1/4/12 12:35 PM, "Zhihong Yu (Commented) (JIRA)" <jira@apache.org>
          Hide
          Prakash Khemani added a comment -

          set the interrupt flag on InterruptedException. Remove OrphanLogException handling for distributed log splitting.

          Show
          Prakash Khemani added a comment - set the interrupt flag on InterruptedException. Remove OrphanLogException handling for distributed log splitting.
          Hide
          Ted Yu added a comment -

          If the caller thread is interrupted while waiting for status to change to DELETED ...

          My comment @ 04/Jan/12 19:46 was about oldtask.wait() being awakened due to spurious wakeup from JVM.

          You can use the following to mimic the environment of hadoopx build machines when you run unit tests:

          -            <argLine>-enableassertions -Xmx1900m -Djava.security.egd=file:/dev/./urandom</argLine>
          +            <argLine>-d32 -enableassertions -Xmx1900m -Djava.security.egd=file:/dev/./urandom</argLine>
          

          Looking at https://builds.apache.org/job/PreCommit-HBASE-Build/665/testReport/org.apache.hadoop.hbase.master/TestSplitLogManager/, it seems all the tests passed.

          Maybe N Keywal would have some idea about this.

          Show
          Ted Yu added a comment - If the caller thread is interrupted while waiting for status to change to DELETED ... My comment @ 04/Jan/12 19:46 was about oldtask.wait() being awakened due to spurious wakeup from JVM. You can use the following to mimic the environment of hadoopx build machines when you run unit tests: - <argLine>-enableassertions -Xmx1900m -Djava.security.egd=file:/dev/./urandom</argLine> + <argLine>-d32 -enableassertions -Xmx1900m -Djava.security.egd=file:/dev/./urandom</argLine> Looking at https://builds.apache.org/job/PreCommit-HBASE-Build/665/testReport/org.apache.hadoop.hbase.master/TestSplitLogManager/ , it seems all the tests passed. Maybe N Keywal would have some idea about this.
          Hide
          Prakash Khemani added a comment -

          If there is a spurious wakeup before the status has changed to DELETE then
          the code will return error (oldtask) to the caller.

          Regarding the hung TestSplitLogManager test in
          https://builds.apache.org/job/PreCommit-HBASE-Build/665/console - I
          couldn't find what failed or what hung.
          https://builds.apache.org/job/PreCommit-HBASE-Build/665//testReport/org.apa
          che.hadoop.hbase.master/TestSplitLogManager/ shows that everything passed.

          Show
          Prakash Khemani added a comment - If there is a spurious wakeup before the status has changed to DELETE then the code will return error (oldtask) to the caller. Regarding the hung TestSplitLogManager test in https://builds.apache.org/job/PreCommit-HBASE-Build/665/console - I couldn't find what failed or what hung. https://builds.apache.org/job/PreCommit-HBASE-Build/665//testReport/org.apa che.hadoop.hbase.master/TestSplitLogManager/ shows that everything passed.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12509461/0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 77 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/668//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/668//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/668//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12509461/0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 77 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/668//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/668//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/668//console This message is automatically generated.
          Hide
          stack added a comment -

          I'm running tests locally.

          Show
          stack added a comment - I'm running tests locally.
          Hide
          stack added a comment -

          All tests but TestClassLoading (unrelated) pass for me locally:

          ...
          Results :
          
          Failed tests:   testClassLoadingFromHDFS(org.apache.hadoop.hbase.coprocessor.TestClassLoading): Class TestCP1 was missing on a region
          
          Tests in error: 
            testEnableTableRoundRobinAssignment(org.apache.hadoop.hbase.client.TestAdmin): org.apache.hadoop.hbase.TableNotEnabledException: testEnableTableAssignment
          
          Tests run: 792, Failures: 1, Errors: 1, Skipped: 10
          
          [INFO] ------------------------------------------------------------------------
          [ERROR] BUILD FAILURE
          [INFO] ------------------------------------------------------------------------
          [INFO] There are test failures.
          
          Show
          stack added a comment - All tests but TestClassLoading (unrelated) pass for me locally: ... Results : Failed tests: testClassLoadingFromHDFS(org.apache.hadoop.hbase.coprocessor.TestClassLoading): Class TestCP1 was missing on a region Tests in error: testEnableTableRoundRobinAssignment(org.apache.hadoop.hbase.client.TestAdmin): org.apache.hadoop.hbase.TableNotEnabledException: testEnableTableAssignment Tests run: 792, Failures: 1, Errors: 1, Skipped: 10 [INFO] ------------------------------------------------------------------------ [ERROR] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] There are test failures.
          Hide
          stack added a comment -

          You down w/ my committing Ted?

          Show
          stack added a comment - You down w/ my committing Ted?
          Hide
          Ted Yu added a comment -
          +          try {
          +            oldtask.wait();
          +          } catch (InterruptedException e) {
          

          I think the wait() above should be enclosed in a loop. The rationale is that spurious wakeup may mask successful deletion of old task.

          I changed the if statement to while:

                  while (oldtask.status == FAILURE) {
          

          The following two tests passed:

            850  mt -Dtest=TestSplitLogManager
            851  mt -Dtest=TestDistributedLogSplitting
          

          Also, it would be nice if someone can verify that the problem is fixed in a real cluster.

          I am waiting for a real cluster to perform testing because our 60-node cluster is currently being used for hadoop testing.

          Show
          Ted Yu added a comment - + try { + oldtask.wait(); + } catch (InterruptedException e) { I think the wait() above should be enclosed in a loop. The rationale is that spurious wakeup may mask successful deletion of old task. I changed the if statement to while: while (oldtask.status == FAILURE) { The following two tests passed: 850 mt -Dtest=TestSplitLogManager 851 mt -Dtest=TestDistributedLogSplitting Also, it would be nice if someone can verify that the problem is fixed in a real cluster. I am waiting for a real cluster to perform testing because our 60-node cluster is currently being used for hadoop testing.
          Hide
          stack added a comment -

          Put up your version of the patch Ted and lets commit it. I'll roll a new RC and lets test that rather than individual patch.

          Show
          stack added a comment - Put up your version of the patch Ted and lets commit it. I'll roll a new RC and lets test that rather than individual patch.
          Hide
          Ted Yu added a comment -

          Patch based on Prakash's latest patch.
          Changed 'if (oldtask.status == FAILURE)' to a while loop.
          Also restored @param for the first SplitLogManager ctor

          Show
          Ted Yu added a comment - Patch based on Prakash's latest patch. Changed 'if (oldtask.status == FAILURE)' to a while loop. Also restored @param for the first SplitLogManager ctor
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12509487/5081-deleteNode-with-while-loop.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 78 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.mapreduce.TestImportTsv
          org.apache.hadoop.hbase.mapred.TestTableMapReduce
          org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/669//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/669//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/669//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12509487/5081-deleteNode-with-while-loop.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 78 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/669//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/669//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/669//console This message is automatically generated.
          Hide
          Prakash Khemani added a comment -
          +        while (oldtask.status == FAILURE) {
          +          // wait for status to change to DELETED
          +          try {
          +            oldtask.wait();
          +          } catch (InterruptedException e) {
          +            Thread.currentThread().interrupt();
          +            LOG.warn("Interrupted when waiting for znode delete
          callback");
          +            // fall through to return failure
                     }
          -          oldtask.setBatch(batch);
                   }
          
          

          Changing the 'if' to 'while' is OK. But in case of interruptedexception
          you should exit the while loop and fall through and return.

          If you don't return on interrupt then there is a good possibility of
          deadlock when the process is trying to exit.

          Also there is no point calling oldtask.wait() with the thread's interrupt
          set. It will immediately throw InterruptedException again.

          Show
          Prakash Khemani added a comment - + while (oldtask.status == FAILURE) { + // wait for status to change to DELETED + try { + oldtask.wait(); + } catch (InterruptedException e) { + Thread .currentThread().interrupt(); + LOG.warn("Interrupted when waiting for znode delete callback"); + // fall through to return failure } - oldtask.setBatch(batch); } Changing the 'if' to 'while' is OK. But in case of interruptedexception you should exit the while loop and fall through and return. If you don't return on interrupt then there is a good possibility of deadlock when the process is trying to exit. Also there is no point calling oldtask.wait() with the thread's interrupt set. It will immediately throw InterruptedException again.
          Hide
          Ted Yu added a comment -

          @Prakash:
          Thanks for your sharp eyes. I will upload another patch when I get back home.

          Show
          Ted Yu added a comment - @Prakash: Thanks for your sharp eyes. I will upload another patch when I get back home.
          Hide
          stack added a comment -

          Changing the 'if' to 'while' is OK. But in case of interruptedexception...you should exit the while loop and fall through and return.

          @Ted This would seem to say that the while should just stay an if?

          Otherwise, the same old test failures; it don't seem to be an issue w/ this patch

          Show
          stack added a comment - Changing the 'if' to 'while' is OK. But in case of interruptedexception...you should exit the while loop and fall through and return. @Ted This would seem to say that the while should just stay an if? Otherwise, the same old test failures; it don't seem to be an issue w/ this patch
          Hide
          Ted Yu added a comment -

          I plan to introduce a break statement in the catch block.
          I think that's what Prakash was hinting.

          Show
          Ted Yu added a comment - I plan to introduce a break statement in the catch block. I think that's what Prakash was hinting.
          Hide
          Ted Yu added a comment -

          Patch with break statement in catch block.

          TestDistributedLogSplitting and TestSplitLogManager pass with this patch.

          Show
          Ted Yu added a comment - Patch with break statement in catch block. TestDistributedLogSplitting and TestSplitLogManager pass with this patch.
          Hide
          Lars Hofhansl added a comment -

          Couldn't you just move the try/catch outside the while loop like so:

          try {
            while (oldtask.status == FAILURE) {
              // wait for status to change to DELETED
              oldtask.wait();
            }
          } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            LOG.warn("Interrupted when waiting for znode delete callback");
            // fall through to return failure
          }
          

          ?

          Show
          Lars Hofhansl added a comment - Couldn't you just move the try/catch outside the while loop like so: try { while (oldtask.status == FAILURE) { // wait for status to change to DELETED oldtask.wait(); } } catch (InterruptedException e) { Thread .currentThread().interrupt(); LOG.warn( "Interrupted when waiting for znode delete callback" ); // fall through to return failure } ?
          Hide
          Prakash Khemani added a comment -

          Latest patch uploaded by Ted looks good. I will try to develop a test case
          for the delayed delete handling.

          On 1/4/12 9:44 PM, "Lars Hofhansl (Commented) (JIRA)" <jira@apache.org>

          Show
          Prakash Khemani added a comment - Latest patch uploaded by Ted looks good. I will try to develop a test case for the delayed delete handling. On 1/4/12 9:44 PM, "Lars Hofhansl (Commented) (JIRA)" <jira@apache.org>
          Hide
          Ted Yu added a comment -

          @Lars:
          I think that should do as well.

          Show
          Ted Yu added a comment - @Lars: I think that should do as well.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12509507/5081-deleteNode-with-while-loop.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 78 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.mapreduce.TestImportTsv
          org.apache.hadoop.hbase.mapred.TestTableMapReduce
          org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/670//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/670//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/670//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12509507/5081-deleteNode-with-while-loop.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 78 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/670//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/670//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/670//console This message is automatically generated.
          Hide
          Prakash Khemani added a comment -

          a test added on top of Ted's last change.

          Show
          Prakash Khemani added a comment - a test added on top of Ted's last change.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12509619/0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 79 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.mapreduce.TestImportTsv
          org.apache.hadoop.hbase.mapred.TestTableMapReduce
          org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/674//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/674//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/674//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12509619/0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 79 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/674//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/674//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/674//console This message is automatically generated.
          Hide
          Jimmy Xiang added a comment -

          I tested the patch with the while loop on my 7 nodes cluster and I haven't got any deadlock so far.

          Show
          Jimmy Xiang added a comment - I tested the patch with the while loop on my 7 nodes cluster and I haven't got any deadlock so far.
          Hide
          Ted Yu added a comment -

          The new test Prakash added passed.
          I found an unrelated change in pom.xml

          This patch removes that change.

          Show
          Ted Yu added a comment - The new test Prakash added passed. I found an unrelated change in pom.xml This patch removes that change.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12509628/HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 79 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/676//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/676//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/676//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12509628/HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 79 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/676//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/676//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/676//console This message is automatically generated.
          Hide
          stack added a comment -

          Whoopdee! Lets commit!

          Show
          stack added a comment - Whoopdee! Lets commit!
          Hide
          Ted Yu added a comment -

          Integrated to 0.92 and TRUNK.

          Thanks for the discovery, review and verification, Jimmy.

          Thanks for the patch, Prakash.

          Thanks for the review Stack and Lars.

          Show
          Ted Yu added a comment - Integrated to 0.92 and TRUNK. Thanks for the discovery, review and verification, Jimmy. Thanks for the patch, Prakash. Thanks for the review Stack and Lars.
          Hide
          Jimmy Xiang added a comment -

          With the latest patch, I got something funny again. Please see the screen shot attached.

          Show
          Jimmy Xiang added a comment - With the latest patch, I got something funny again. Please see the screen shot attached.
          Hide
          Ted Yu added a comment -

          Was the log splitting for 1325810744985 ever done ?
          In case of retry, the following code would be executed multiple times:

            public long splitLogDistributed(final List<Path> logDirs) throws IOException {
              MonitoredTask status = TaskMonitor.get().createStatus(
                    "Doing distributed log split in " + logDirs);
          

          leading to multiple MonitoredTask instances.

          The redundant MonitoredTask instances can be handled in a different JIRA.

          Show
          Ted Yu added a comment - Was the log splitting for 1325810744985 ever done ? In case of retry, the following code would be executed multiple times: public long splitLogDistributed( final List<Path> logDirs) throws IOException { MonitoredTask status = TaskMonitor.get().createStatus( "Doing distributed log split in " + logDirs); leading to multiple MonitoredTask instances. The redundant MonitoredTask instances can be handled in a different JIRA.
          Hide
          Jimmy Xiang added a comment -

          It hangs again. In the region server log, I saw some DFS issue. Let me restart the cluster. Hopefully, it will move on.

          Show
          Jimmy Xiang added a comment - It hangs again. In the region server log, I saw some DFS issue. Let me restart the cluster. Hopefully, it will move on.
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2614 (See https://builds.apache.org/job/HBase-TRUNK/2614/)
          HBASE-5081 Distributed log splitting deleteNode races against splitLog retry (Prakash)

          tedyu :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKSplitLog.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2614 (See https://builds.apache.org/job/HBase-TRUNK/2614/ ) HBASE-5081 Distributed log splitting deleteNode races against splitLog retry (Prakash) tedyu : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKSplitLog.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Hide
          Ted Yu added a comment -

          We experienced log splitting weirdness due to hdfs problem before (on an earlier version of 0.92)
          I think handling hdfs failure should be done in a separate JIRA.

          Show
          Ted Yu added a comment - We experienced log splitting weirdness due to hdfs problem before (on an earlier version of 0.92) I think handling hdfs failure should be done in a separate JIRA.
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #226 (See https://builds.apache.org/job/HBase-0.92/226/)
          HBASE-5081 Distributed log splitting deleteNode races against splitLog retry (Prakash)

          tedyu :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKSplitLog.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Show
          Hudson added a comment - Integrated in HBase-0.92 #226 (See https://builds.apache.org/job/HBase-0.92/226/ ) HBASE-5081 Distributed log splitting deleteNode races against splitLog retry (Prakash) tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKSplitLog.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-security #65 (See https://builds.apache.org/job/HBase-TRUNK-security/65/)
          HBASE-5081 Distributed log splitting deleteNode races against splitLog retry (Prakash)

          tedyu :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKSplitLog.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-security #65 (See https://builds.apache.org/job/HBase-TRUNK-security/65/ ) HBASE-5081 Distributed log splitting deleteNode races against splitLog retry (Prakash) tedyu : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKSplitLog.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92-security #63 (See https://builds.apache.org/job/HBase-0.92-security/63/)
          HBASE-5081 Distributed log splitting deleteNode races against splitLog retry (Prakash)

          tedyu :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKSplitLog.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Show
          Hudson added a comment - Integrated in HBase-0.92-security #63 (See https://builds.apache.org/job/HBase-0.92-security/63/ ) HBASE-5081 Distributed log splitting deleteNode races against splitLog retry (Prakash) tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogSplitter.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKSplitLog.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
          Hide
          stack added a comment -

          @Jimmy After restart, can you get it to hang? Lets open new issue if you can. How are you testing out of interest so I can try it over here. Thanks.

          Show
          stack added a comment - @Jimmy After restart, can you get it to hang? Lets open new issue if you can. How are you testing out of interest so I can try it over here. Thanks.
          Hide
          Jimmy Xiang added a comment -

          After restart, it still doesn't work. See the attached 3rd screen shot.

          Probably we should commit this one and open a new Jira.

          @Stack, to reproduce it, you can set these properties and run bigtop TestLoadAndVerify:

          <property>
          <name>hbase.hregion.max.filesize</name>
          <value>1048576</value>
          </property>
          <property>
          <name>hbase.master.distributed.log.splitting</name>
          <value>true</value>
          </property>

          <property>
          <name>io.file.buffer.size</name>
          <value>131072</value>
          <description>Hadoop setting </description>
          </property>
          <property>
          <name>hbase.balancer.period
          </name>
          <value>2000</value>
          <description>Period at which the region balancer runs in the Master.
          </description>
          </property>
          <property>
          <name>hbase.hregion.memstore.flush.size</name>
          <value>262144</value> <!-- 256KB -->
          </property>

          Show
          Jimmy Xiang added a comment - After restart, it still doesn't work. See the attached 3rd screen shot. Probably we should commit this one and open a new Jira. @Stack, to reproduce it, you can set these properties and run bigtop TestLoadAndVerify: <property> <name>hbase.hregion.max.filesize</name> <value>1048576</value> </property> <property> <name>hbase.master.distributed.log.splitting</name> <value>true</value> </property> <property> <name>io.file.buffer.size</name> <value>131072</value> <description>Hadoop setting </description> </property> <property> <name>hbase.balancer.period </name> <value>2000</value> <description>Period at which the region balancer runs in the Master. </description> </property> <property> <name>hbase.hregion.memstore.flush.size</name> <value>262144</value> <!-- 256KB --> </property>
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12509690/distributed_log_splitting_screenshot3.png
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/685//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12509690/distributed_log_splitting_screenshot3.png against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/685//console This message is automatically generated.
          Hide
          Jimmy Xiang added a comment -

          It turns out all my region servers died. I restarted them (rs) all, and things are looking better now. One folder is completed. Two more to go.

          Show
          Jimmy Xiang added a comment - It turns out all my region servers died. I restarted them (rs) all, and things are looking better now. One folder is completed. Two more to go.
          Hide
          stack added a comment -

          @Jimmy And can you make it hang subsequently? Thanks for the prescription on how to repo. Will try it...

          Show
          stack added a comment - @Jimmy And can you make it hang subsequently? Thanks for the prescription on how to repo. Will try it...
          Hide
          Jimmy Xiang added a comment -

          Now, all logs are split. I am happy with the patch.

          Show
          Jimmy Xiang added a comment - Now, all logs are split. I am happy with the patch.
          Hide
          Jimmy Xiang added a comment -

          @Stack, yes, it will screw up the cluster (7 nodes).

          Show
          Jimmy Xiang added a comment - @Stack, yes, it will screw up the cluster (7 nodes).
          Hide
          Ted Yu added a comment -

          Will open new issue if other distributed log splitting bug is discovered.

          Show
          Ted Yu added a comment - Will open new issue if other distributed log splitting bug is discovered.

            People

            • Assignee:
              Prakash Khemani
              Reporter:
              Jimmy Xiang
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development