Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-415

Unchecked exception thrown inside of BlockReceiver cause some threads hang

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None

      Description

      One is able to inject all sorts of faults into Hadoop's classes using new fault injection framework (HADOOP-6003).
      I've been injecting unchecked exception (RuntimeException) into BlockReceiver.receivePacket() method before any
      of write() operations (e.g. line 401, 449, 463, 529) and running some of the existing HDFS tests. The injection of unchecked exceptions causes DataXceiver to die silently and without any traces.

      From a debugger run it seems like some threads are being left alive or not notified about the exception.

      1. copy.txt.log
        28 kB
        Konstantin Boudnik
      2. HADOOP-6073.patch
        2 kB
        Konstantin Boudnik
      3. HADOOP-6073.patch
        0.8 kB
        Konstantin Boudnik
      4. x2
        29 kB
        Konstantin Boudnik

        Activity

        Hide
        Konstantin Boudnik added a comment -

        A unit test log file and jstack dump of the VM, running the tests

        Show
        Konstantin Boudnik added a comment - A unit test log file and jstack dump of the VM, running the tests
        Hide
        Raghu Angadi added a comment -


        BlockReceiver.receiveBlock() does not clean up properly in case of runtime exception. It should interrupt the responder inside finally clause rather than just when IOException is caught. DN functions normally for other IO requests.

        Show
        Raghu Angadi added a comment - BlockReceiver.receiveBlock() does not clean up properly in case of runtime exception. It should interrupt the responder inside finally clause rather than just when IOException is caught. DN functions normally for other IO requests.
        Hide
        Konstantin Boudnik added a comment -

        I also suggest to change line 555 to
        } catch (Exception ioe) {

        to make sure that an affected block is properly cleaned.

        Show
        Konstantin Boudnik added a comment - I also suggest to change line 555 to } catch (Exception ioe) { to make sure that an affected block is properly cleaned.
        Hide
        Konstantin Boudnik added a comment -

        M.b. something along these lines (see attachment)? Moving responder.interrupt() invocation to the finally

        {...}

        doesn't make us much good, because responder thread has to be stopped only in case of error.

        Catching Throwable and wrapping it into IOException (as Raghu suggestion) seems to do the trick. I can confirm that I don't see any more of the 'hanging' behavior in the tests with this patch applied.

        Show
        Konstantin Boudnik added a comment - M.b. something along these lines (see attachment)? Moving responder.interrupt() invocation to the finally {...} doesn't make us much good, because responder thread has to be stopped only in case of error. Catching Throwable and wrapping it into IOException (as Raghu suggestion) seems to do the trick. I can confirm that I don't see any more of the 'hanging' behavior in the tests with this patch applied.
        Hide
        Konstantin Boudnik added a comment -

        This version of the patch seems to be less mystique and won't rise the question why we need to catch Throwable in this try-catch block (thanks for the suggestion, Raghu!)

        I can confirm that with the patch in place the original problem is gone and tests are failing gracefully in case of RuntimeException being thrown inside of the BlockReceiver class.

        All tests have passed with FI stuff turned off.

        Show
        Konstantin Boudnik added a comment - This version of the patch seems to be less mystique and won't rise the question why we need to catch Throwable in this try-catch block (thanks for the suggestion, Raghu!) I can confirm that with the patch in place the original problem is gone and tests are failing gracefully in case of RuntimeException being thrown inside of the BlockReceiver class. All tests have passed with FI stuff turned off.
        Hide
        Konstantin Boudnik added a comment -

        The results of test-patch run. The mentioned JavaDoc warning belongs to DFSClient class which isn't a consideration of this patch.

        There appear to be 283 release audit warnings before the patch and 283 release audit warnings after applying the patch.

        -1 overall.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        Show
        Konstantin Boudnik added a comment - The results of test-patch run. The mentioned JavaDoc warning belongs to DFSClient class which isn't a consideration of this patch. There appear to be 283 release audit warnings before the patch and 283 release audit warnings after applying the patch. -1 overall. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings.
        Hide
        Raghu Angadi added a comment -

        +1 for the patch.

        Show
        Raghu Angadi added a comment - +1 for the patch.
        Hide
        Konstantin Boudnik added a comment -

        The above mentioned java warnings are unrelated and seems to belong to a totally different class. Thus, the Hudson's comment could be disregarded.

        Show
        Konstantin Boudnik added a comment - The above mentioned java warnings are unrelated and seems to belong to a totally different class. Thus, the Hudson's comment could be disregarded.
        Hide
        Raghu Angadi added a comment -

        I just committed this. Thanks Konstantin.

        Show
        Raghu Angadi added a comment - I just committed this. Thanks Konstantin.
        Hide
        Konstantin Boudnik added a comment -

        Late comment:

        • this patch doesn't have new unit tests for its correctness is verifiable by the existing test TestDirectoryScanner when BlockReceiver's fault throws RuntimeException.

        Test should fail in this case rather than merely hang forever. This method has been used to verify the correctness of the patch.

        Show
        Konstantin Boudnik added a comment - Late comment: this patch doesn't have new unit tests for its correctness is verifiable by the existing test TestDirectoryScanner when BlockReceiver's fault throws RuntimeException. Test should fail in this case rather than merely hang forever. This method has been used to verify the correctness of the patch.

          People

          • Assignee:
            Konstantin Boudnik
            Reporter:
            Konstantin Boudnik
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development