Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1203

DataNode should sleep before reentering service loop after an exception

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.22.0
    • Component/s: datanode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When the DN gets an exception in response to a heartbeat, it logs it and continues, but there is no sleep. I've occasionally seen bugs produce a case where heartbeats continuously produce exceptions, and thus the DN floods the NN with bad heartbeats. Adding a 1 second sleep at least throttles the error messages for easier debugging and error isolation.

      1. hdfs-1203.txt
        0.6 kB
        Todd Lipcon
      2. hdfs-1203.txt
        0.7 kB
        Todd Lipcon

        Activity

        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12446925/hdfs-1203.txt
        against trunk revision 957669.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/197/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/197/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/197/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/197/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12446925/hdfs-1203.txt against trunk revision 957669. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/197/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/197/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/197/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/197/console This message is automatically generated.
        Hide
        Jakob Homan added a comment -

        +1. This sounds reasonable.

        Show
        Jakob Homan added a comment - +1. This sounds reasonable.
        Hide
        Jakob Homan added a comment -

        Re-submitting to Hudson to get another run of the tests, just for completeness, since the original run has expired. However, Hudon's not been around a lot lately, and so it may be more expedient for Todd to run the tests locally and report the results here, if he wishes.

        Show
        Jakob Homan added a comment - Re-submitting to Hudson to get another run of the tests, just for completeness, since the original run has expired. However, Hudon's not been around a lot lately, and so it may be more expedient for Todd to run the tests locally and report the results here, if he wishes.
        Hide
        Todd Lipcon added a comment -

        Ran commit-tests locally and passed (Hudson seems dead)

        Show
        Todd Lipcon added a comment - Ran commit-tests locally and passed (Hudson seems dead)
        Hide
        Hairong Kuang added a comment -

        It's better if the sleep interval could take the configured heartbeat interval into consideration. For example, setting it to be the min of 1000 & the heartbeat interval.

        Show
        Hairong Kuang added a comment - It's better if the sleep interval could take the configured heartbeat interval into consideration. For example, setting it to be the min of 1000 & the heartbeat interval.
        Hide
        Todd Lipcon added a comment -

        Thanks for the suggestion. Attached a new patch using Hairong's logic, and also changing the interruption behavior to re-interrupt the current thread (usually a better idea)

        Show
        Todd Lipcon added a comment - Thanks for the suggestion. Attached a new patch using Hairong's logic, and also changing the interruption behavior to re-interrupt the current thread (usually a better idea)
        Hide
        Todd Lipcon added a comment -

        Looks like this accidentally got incorporated in the commit for HDFS-881 back in September.

        Show
        Todd Lipcon added a comment - Looks like this accidentally got incorporated in the commit for HDFS-881 back in September.

          People

          • Assignee:
            Todd Lipcon
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development