Details

      Description

      After HDFS-5583, TestBlockRecovery.testRaceBetweenReplicaRecoveryAndFinalizeBlock started failing. It seems HDFS-5583 exposed a bug.

      When a receiver thread is interrupted, it is supposed to interrupt responder and join on it. The join timeout is configurable. This is not what actually happens. It was fixed in HDFS-5583 and now the test case that depended on the broken behavior is breaking.

        Activity

        Hide
        Kihwal Lee added a comment - - edited

        Before HDFS-5583, the interrupted flag was not consumed before join(), so join() always threw InterruptedException right away and it never actually worked. I noticed unexpected early termination of threads and found the uncleared flag to be the cause.

        There are two flaws.

        1) In the failing test case, the responder thread is blocked on a synchronized method and the test is calling another synchronized method before the responder, blocking the responder. Since synchronized methods cannot be interrupted, the responder would not terminate. Before fixing the uncleared flag issue, the receiver would blow up right away and the synchronized method being called by the test case would return (join on the receiver returns). The blocked responder is not in the critical path of this since join() on the responder was not actually done. The responder eventually unblocks and terminates on its own later.

        The correct test would either increase the test timeout to be longer than the join timeout ("dfs.datanode.xceiver.stop.timeout.millis") or set the join timeout to be shorter.

        2) stopWriter() has the same join() timeout as the one used for the receiver joining on the responder. It means that even if join() times out on the responder, stopWriter() will likely fail on timeout. A shorter timeout should be used when joining on the responder.

        Show
        Kihwal Lee added a comment - - edited Before HDFS-5583 , the interrupted flag was not consumed before join(), so join() always threw InterruptedException right away and it never actually worked. I noticed unexpected early termination of threads and found the uncleared flag to be the cause. There are two flaws. 1) In the failing test case, the responder thread is blocked on a synchronized method and the test is calling another synchronized method before the responder, blocking the responder. Since synchronized methods cannot be interrupted, the responder would not terminate. Before fixing the uncleared flag issue, the receiver would blow up right away and the synchronized method being called by the test case would return (join on the receiver returns). The blocked responder is not in the critical path of this since join() on the responder was not actually done. The responder eventually unblocks and terminates on its own later. The correct test would either increase the test timeout to be longer than the join timeout ("dfs.datanode.xceiver.stop.timeout.millis") or set the join timeout to be shorter. 2) stopWriter() has the same join() timeout as the one used for the receiver joining on the responder. It means that even if join() times out on the responder, stopWriter() will likely fail on timeout. A shorter timeout should be used when joining on the responder.
        Hide
        Kihwal Lee added a comment -

        The attached patch fixes the two issue listed above.

        Show
        Kihwal Lee added a comment - The attached patch fixes the two issue listed above.
        Hide
        Chris Nauroth added a comment -

        +1 for the patch. Good finds, Kihwal.

        Show
        Chris Nauroth added a comment - +1 for the patch. Good finds, Kihwal.
        Hide
        Kihwal Lee added a comment -

        Thanks for the review, Chris. I've committed this to the RU branch.

        Show
        Kihwal Lee added a comment - Thanks for the review, Chris. I've committed this to the RU branch.

          People

          • Assignee:
            Kihwal Lee
            Reporter:
            Kihwal Lee
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development