Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1850

DN should transmit absolute failed volume count rather than increments to the NN

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.22.0
    • Component/s: datanode, namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The API added in HDFS-811 for the DN to report volume failures to the NN is "inc(DN)". However the given sequence of events will result in the NN forgetting about reported failed volumes:

      1. DN loses a volume and reports it
      2. NN restarts
      3. DN re-registers to the new NN

      A more robust interface would be to have the DN report the total number of volume failures to the NN each heart beat (the same way other volume state is transmitted).

      1. hdfs-1850-1.patch
        42 kB
        Eli Collins
      2. hdfs-1850-2.patch
        49 kB
        Eli Collins
      3. hdfs-1850-3.patch
        49 kB
        Eli Collins
      4. hdfs-1850-4.patch
        49 kB
        Eli Collins
      5. hdfs-1850-5.patch
        48 kB
        Eli Collins
      6. hdfs-1850-6.patch
        48 kB
        Eli Collins
      7. hdfs-1850-7.patch
        48 kB
        Eli Collins

        Activity

        Hide
        Todd Lipcon added a comment -

        I don't think changes to DN->NN RPC are considered "incompatible changes" going from eg 0.22 to 0.23, since they don't affect users. Given that we don't purport to allow rolling upgrade between releases like this, it's not particularly an issue, right?

        Show
        Todd Lipcon added a comment - I don't think changes to DN->NN RPC are considered "incompatible changes" going from eg 0.22 to 0.23, since they don't affect users. Given that we don't purport to allow rolling upgrade between releases like this, it's not particularly an issue, right?
        Hide
        Eli Collins added a comment -

        Good point. Flag removed.

        Show
        Eli Collins added a comment - Good point. Flag removed.
        Hide
        Eli Collins added a comment -

        Patch attached.

        1. Modifies FSDataset to track and report volume failures like other capacity etc. Adds the test listed in the description, makes TestDataNodeVolumeFailureReporting more robust.

        2. Renames the volumesFailed metric to volumeFailures to accurately reflect what it's tracking. This doesn't break compatibility because this metric (added in HDFS-811) has not yet been released.

        Show
        Eli Collins added a comment - Patch attached. 1. Modifies FSDataset to track and report volume failures like other capacity etc. Adds the test listed in the description, makes TestDataNodeVolumeFailureReporting more robust. 2. Renames the volumesFailed metric to volumeFailures to accurately reflect what it's tracking. This doesn't break compatibility because this metric (added in HDFS-811 ) has not yet been released.
        Hide
        Eli Collins added a comment -

        Updated patch attached. Breaks out the testing of DFS_DATANODE_FAILED_VOLUMES_TOLERATED to a new test so it's easier to add new tests of this functionality.

        Show
        Eli Collins added a comment - Updated patch attached. Breaks out the testing of DFS_DATANODE_FAILED_VOLUMES_TOLERATED to a new test so it's easier to add new tests of this functionality.
        Hide
        Todd Lipcon added a comment -

        Some small comments:

        • getNumFailedVols - probably best not to abbreviate vols, since most of the time we use the full word: getNumFailedVolumes
        • in errorReport(...) we seem to log twice in the case that there is a disk error. Maybe the first LOG.info should get moved inside the if statement, and "msg" should be included in the warns?

        Also, should make this Patch Available to run through Hudson.

        Show
        Todd Lipcon added a comment - Some small comments: getNumFailedVols - probably best not to abbreviate vols, since most of the time we use the full word: getNumFailedVolumes in errorReport(...) we seem to log twice in the case that there is a disk error. Maybe the first LOG.info should get moved inside the if statement, and "msg" should be included in the warns? Also, should make this Patch Available to run through Hudson.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12477428/hdfs-1850-2.patch
        against trunk revision 1097252.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 39 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.TestFileConcurrentReader

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/425//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/425//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/425//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477428/hdfs-1850-2.patch against trunk revision 1097252. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 39 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestFileConcurrentReader +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/425//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/425//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/425//console This message is automatically generated.
        Hide
        Eli Collins added a comment -

        Thanks Todd. Patch attached addresses your comments (renames and ensures each call to errorReport gets only one message).

        Show
        Eli Collins added a comment - Thanks Todd. Patch attached addresses your comments (renames and ensures each call to errorReport gets only one message).
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12477622/hdfs-1850-3.patch
        against trunk revision 1097329.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 39 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.TestFileAppend4
        org.apache.hadoop.hdfs.TestLargeBlock
        org.apache.hadoop.hdfs.TestWriteConfigurationToDFS

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/428//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/428//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/428//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477622/hdfs-1850-3.patch against trunk revision 1097329. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 39 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestFileAppend4 org.apache.hadoop.hdfs.TestLargeBlock org.apache.hadoop.hdfs.TestWriteConfigurationToDFS +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/428//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/428//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/428//console This message is automatically generated.
        Hide
        Todd Lipcon added a comment -

        hm, I still see getNumFailedVols in hdfs-1850-3.patch.

        Show
        Todd Lipcon added a comment - hm, I still see getNumFailedVols in hdfs-1850-3.patch.
        Hide
        Eli Collins added a comment -

        Oops, just renamed the variable, not the method as well. Fixed. Patch attached.

        Show
        Eli Collins added a comment - Oops, just renamed the variable, not the method as well. Fixed. Patch attached.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12477703/hdfs-1850-4.patch
        against trunk revision 1097329.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 39 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.TestFileAppend4
        org.apache.hadoop.hdfs.TestLargeBlock
        org.apache.hadoop.hdfs.TestWriteConfigurationToDFS

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/432//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/432//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/432//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477703/hdfs-1850-4.patch against trunk revision 1097329. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 39 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestFileAppend4 org.apache.hadoop.hdfs.TestLargeBlock org.apache.hadoop.hdfs.TestWriteConfigurationToDFS +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/432//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/432//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/432//console This message is automatically generated.
        Hide
        Eli Collins added a comment -

        Patch attached, rebased on trunk.

        Show
        Eli Collins added a comment - Patch attached, rebased on trunk.
        Hide
        Todd Lipcon added a comment -

        hdfs-1850-5.patch seems to have some conflict markers stuck in it (line 206 of the patch)

        Show
        Todd Lipcon added a comment - hdfs-1850-5.patch seems to have some conflict markers stuck in it (line 206 of the patch)
        Hide
        Eli Collins added a comment -

        Arg. Right patch this time.

        Show
        Eli Collins added a comment - Arg. Right patch this time.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12477990/hdfs-1850-6.patch
        against trunk revision 1098781.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 38 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.server.namenode.TestBackupNode
        org.apache.hadoop.hdfs.TestDatanodeBlockScanner
        org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
        org.apache.hadoop.hdfs.TestFileConcurrentReader

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/443//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/443//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/443//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477990/hdfs-1850-6.patch against trunk revision 1098781. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 38 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestBackupNode org.apache.hadoop.hdfs.TestDatanodeBlockScanner org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/443//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/443//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/443//console This message is automatically generated.
        Hide
        Todd Lipcon added a comment -

        Is this TestDatanodeBlockScanner failure related to the patch? It has a strange error message:
        java.util.concurrent.TimeoutException: Timed out waiting for corrupt replicas. Waiting for 1, but only found 1

        Show
        Todd Lipcon added a comment - Is this TestDatanodeBlockScanner failure related to the patch? It has a strange error message: java.util.concurrent.TimeoutException: Timed out waiting for corrupt replicas. Waiting for 1, but only found 1
        Hide
        Eli Collins added a comment -

        TestDatanodeBlockScanner passes for me in eclipse and when looped from the command line. I think this is related to an earlier change. The error message indicates we need to bump the number of attempts (ie it did see 1 corrupt replica after 20 attempts, but it also bails on the 20th attempt).

        java.util.concurrent.TimeoutException: Timed out waiting for corrupt replicas. Waiting for 1, but only found 1

        I'll bump the # attempts to 50 so we're more tolerant.

        Updated patch attached.

        Show
        Eli Collins added a comment - TestDatanodeBlockScanner passes for me in eclipse and when looped from the command line. I think this is related to an earlier change. The error message indicates we need to bump the number of attempts (ie it did see 1 corrupt replica after 20 attempts, but it also bails on the 20th attempt). java.util.concurrent.TimeoutException: Timed out waiting for corrupt replicas. Waiting for 1, but only found 1 I'll bump the # attempts to 50 so we're more tolerant. Updated patch attached.
        Hide
        Todd Lipcon added a comment -

        +1

        Show
        Todd Lipcon added a comment - +1
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12478000/hdfs-1850-7.patch
        against trunk revision 1098781.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 38 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.server.namenode.TestBackupNode
        org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
        org.apache.hadoop.hdfs.TestFileConcurrentReader

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/444//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/444//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/444//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478000/hdfs-1850-7.patch against trunk revision 1098781. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 38 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestBackupNode org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/444//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/444//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/444//console This message is automatically generated.
        Hide
        Eli Collins added a comment -

        Test failures are unrelated. I've committed this to trunk and branch 22. Thanks Todd!

        Show
        Eli Collins added a comment - Test failures are unrelated. I've committed this to trunk and branch 22. Thanks Todd!
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-22-branch #41 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/41/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-22-branch #41 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/41/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #673 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/673/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #673 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/673/ )

          People

          • Assignee:
            Eli Collins
            Reporter:
            Eli Collins
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development