Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1850

DN should transmit absolute failed volume count rather than increments to the NN

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.22.0
    • Component/s: datanode, namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The API added in HDFS-811 for the DN to report volume failures to the NN is "inc(DN)". However the given sequence of events will result in the NN forgetting about reported failed volumes:

      1. DN loses a volume and reports it
      2. NN restarts
      3. DN re-registers to the new NN

      A more robust interface would be to have the DN report the total number of volume failures to the NN each heart beat (the same way other volume state is transmitted).

      1. hdfs-1850-1.patch
        42 kB
        Eli Collins
      2. hdfs-1850-2.patch
        49 kB
        Eli Collins
      3. hdfs-1850-3.patch
        49 kB
        Eli Collins
      4. hdfs-1850-4.patch
        49 kB
        Eli Collins
      5. hdfs-1850-5.patch
        48 kB
        Eli Collins
      6. hdfs-1850-6.patch
        48 kB
        Eli Collins
      7. hdfs-1850-7.patch
        48 kB
        Eli Collins

        Activity

        Eli Collins created issue -
        Hide
        Todd Lipcon added a comment -

        I don't think changes to DN->NN RPC are considered "incompatible changes" going from eg 0.22 to 0.23, since they don't affect users. Given that we don't purport to allow rolling upgrade between releases like this, it's not particularly an issue, right?

        Show
        Todd Lipcon added a comment - I don't think changes to DN->NN RPC are considered "incompatible changes" going from eg 0.22 to 0.23, since they don't affect users. Given that we don't purport to allow rolling upgrade between releases like this, it's not particularly an issue, right?
        Hide
        Eli Collins added a comment -

        Good point. Flag removed.

        Show
        Eli Collins added a comment - Good point. Flag removed.
        Eli Collins made changes -
        Field Original Value New Value
        Hadoop Flags [Incompatible change]
        Hide
        Eli Collins added a comment -

        Patch attached.

        1. Modifies FSDataset to track and report volume failures like other capacity etc. Adds the test listed in the description, makes TestDataNodeVolumeFailureReporting more robust.

        2. Renames the volumesFailed metric to volumeFailures to accurately reflect what it's tracking. This doesn't break compatibility because this metric (added in HDFS-811) has not yet been released.

        Show
        Eli Collins added a comment - Patch attached. 1. Modifies FSDataset to track and report volume failures like other capacity etc. Adds the test listed in the description, makes TestDataNodeVolumeFailureReporting more robust. 2. Renames the volumesFailed metric to volumeFailures to accurately reflect what it's tracking. This doesn't break compatibility because this metric (added in HDFS-811 ) has not yet been released.
        Eli Collins made changes -
        Attachment hdfs-1850-1.patch [ 12477358 ]
        Eli Collins made changes -
        Fix Version/s 0.22.0 [ 12314241 ]
        Hide
        Eli Collins added a comment -

        Updated patch attached. Breaks out the testing of DFS_DATANODE_FAILED_VOLUMES_TOLERATED to a new test so it's easier to add new tests of this functionality.

        Show
        Eli Collins added a comment - Updated patch attached. Breaks out the testing of DFS_DATANODE_FAILED_VOLUMES_TOLERATED to a new test so it's easier to add new tests of this functionality.
        Eli Collins made changes -
        Attachment hdfs-1850-2.patch [ 12477428 ]
        Hide
        Todd Lipcon added a comment -

        Some small comments:

        • getNumFailedVols - probably best not to abbreviate vols, since most of the time we use the full word: getNumFailedVolumes
        • in errorReport(...) we seem to log twice in the case that there is a disk error. Maybe the first LOG.info should get moved inside the if statement, and "msg" should be included in the warns?

        Also, should make this Patch Available to run through Hudson.

        Show
        Todd Lipcon added a comment - Some small comments: getNumFailedVols - probably best not to abbreviate vols, since most of the time we use the full word: getNumFailedVolumes in errorReport(...) we seem to log twice in the case that there is a disk error. Maybe the first LOG.info should get moved inside the if statement, and "msg" should be included in the warns? Also, should make this Patch Available to run through Hudson.
        Eli Collins made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12477428/hdfs-1850-2.patch
        against trunk revision 1097252.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 39 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.TestFileConcurrentReader

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/425//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/425//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/425//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477428/hdfs-1850-2.patch against trunk revision 1097252. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 39 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestFileConcurrentReader +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/425//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/425//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/425//console This message is automatically generated.
        Hide
        Eli Collins added a comment -

        Thanks Todd. Patch attached addresses your comments (renames and ensures each call to errorReport gets only one message).

        Show
        Eli Collins added a comment - Thanks Todd. Patch attached addresses your comments (renames and ensures each call to errorReport gets only one message).
        Eli Collins made changes -
        Attachment hdfs-1850-3.patch [ 12477622 ]
        Eli Collins made changes -
        Fix Version/s 0.23.0 [ 12315571 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12477622/hdfs-1850-3.patch
        against trunk revision 1097329.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 39 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.TestFileAppend4
        org.apache.hadoop.hdfs.TestLargeBlock
        org.apache.hadoop.hdfs.TestWriteConfigurationToDFS

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/428//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/428//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/428//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477622/hdfs-1850-3.patch against trunk revision 1097329. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 39 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestFileAppend4 org.apache.hadoop.hdfs.TestLargeBlock org.apache.hadoop.hdfs.TestWriteConfigurationToDFS +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/428//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/428//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/428//console This message is automatically generated.
        Hide
        Todd Lipcon added a comment -

        hm, I still see getNumFailedVols in hdfs-1850-3.patch.

        Show
        Todd Lipcon added a comment - hm, I still see getNumFailedVols in hdfs-1850-3.patch.
        Hide
        Eli Collins added a comment -

        Oops, just renamed the variable, not the method as well. Fixed. Patch attached.

        Show
        Eli Collins added a comment - Oops, just renamed the variable, not the method as well. Fixed. Patch attached.
        Eli Collins made changes -
        Attachment hdfs-1850-4.patch [ 12477703 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12477703/hdfs-1850-4.patch
        against trunk revision 1097329.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 39 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.TestFileAppend4
        org.apache.hadoop.hdfs.TestLargeBlock
        org.apache.hadoop.hdfs.TestWriteConfigurationToDFS

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/432//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/432//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/432//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477703/hdfs-1850-4.patch against trunk revision 1097329. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 39 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestFileAppend4 org.apache.hadoop.hdfs.TestLargeBlock org.apache.hadoop.hdfs.TestWriteConfigurationToDFS +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/432//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/432//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/432//console This message is automatically generated.
        Hide
        Eli Collins added a comment -

        Patch attached, rebased on trunk.

        Show
        Eli Collins added a comment - Patch attached, rebased on trunk.
        Eli Collins made changes -
        Attachment hdfs-1850-5.patch [ 12477863 ]
        Eli Collins made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Eli Collins made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Todd Lipcon added a comment -

        hdfs-1850-5.patch seems to have some conflict markers stuck in it (line 206 of the patch)

        Show
        Todd Lipcon added a comment - hdfs-1850-5.patch seems to have some conflict markers stuck in it (line 206 of the patch)
        Hide
        Eli Collins added a comment -

        Arg. Right patch this time.

        Show
        Eli Collins added a comment - Arg. Right patch this time.
        Eli Collins made changes -
        Attachment hdfs-1850-6.patch [ 12477990 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12477990/hdfs-1850-6.patch
        against trunk revision 1098781.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 38 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.server.namenode.TestBackupNode
        org.apache.hadoop.hdfs.TestDatanodeBlockScanner
        org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
        org.apache.hadoop.hdfs.TestFileConcurrentReader

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/443//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/443//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/443//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477990/hdfs-1850-6.patch against trunk revision 1098781. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 38 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestBackupNode org.apache.hadoop.hdfs.TestDatanodeBlockScanner org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/443//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/443//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/443//console This message is automatically generated.
        Hide
        Todd Lipcon added a comment -

        Is this TestDatanodeBlockScanner failure related to the patch? It has a strange error message:
        java.util.concurrent.TimeoutException: Timed out waiting for corrupt replicas. Waiting for 1, but only found 1

        Show
        Todd Lipcon added a comment - Is this TestDatanodeBlockScanner failure related to the patch? It has a strange error message: java.util.concurrent.TimeoutException: Timed out waiting for corrupt replicas. Waiting for 1, but only found 1
        Hide
        Eli Collins added a comment -

        TestDatanodeBlockScanner passes for me in eclipse and when looped from the command line. I think this is related to an earlier change. The error message indicates we need to bump the number of attempts (ie it did see 1 corrupt replica after 20 attempts, but it also bails on the 20th attempt).

        java.util.concurrent.TimeoutException: Timed out waiting for corrupt replicas. Waiting for 1, but only found 1

        I'll bump the # attempts to 50 so we're more tolerant.

        Updated patch attached.

        Show
        Eli Collins added a comment - TestDatanodeBlockScanner passes for me in eclipse and when looped from the command line. I think this is related to an earlier change. The error message indicates we need to bump the number of attempts (ie it did see 1 corrupt replica after 20 attempts, but it also bails on the 20th attempt). java.util.concurrent.TimeoutException: Timed out waiting for corrupt replicas. Waiting for 1, but only found 1 I'll bump the # attempts to 50 so we're more tolerant. Updated patch attached.
        Eli Collins made changes -
        Attachment hdfs-1850-7.patch [ 12478000 ]
        Eli Collins made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Eli Collins made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Todd Lipcon added a comment -

        +1

        Show
        Todd Lipcon added a comment - +1
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12478000/hdfs-1850-7.patch
        against trunk revision 1098781.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 38 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.server.namenode.TestBackupNode
        org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
        org.apache.hadoop.hdfs.TestFileConcurrentReader

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/444//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/444//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/444//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478000/hdfs-1850-7.patch against trunk revision 1098781. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 38 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestBackupNode org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/444//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/444//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/444//console This message is automatically generated.
        Hide
        Eli Collins added a comment -

        Test failures are unrelated. I've committed this to trunk and branch 22. Thanks Todd!

        Show
        Eli Collins added a comment - Test failures are unrelated. I've committed this to trunk and branch 22. Thanks Todd!
        Eli Collins made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Resolution Fixed [ 1 ]
        Eli Collins made changes -
        Description The API added in HDFS-811 for the DN to report volume failures to the NN is "inc(DN)". However the given sequence of events will result in the NN forgetting about reported failed volumes:

        # DN loses a volume and reports it
        # NN restarts
        # DN re-registers to the new NN

        A more robust interface would be to have the DN report the total number of volume failures to the NN each heart beat (the same way other volume state is transmitted).

        This will likely be an incompatible change since it requires changing the Datanode protocol.
        The API added in HDFS-811 for the DN to report volume failures to the NN is "inc(DN)". However the given sequence of events will result in the NN forgetting about reported failed volumes:

        # DN loses a volume and reports it
        # NN restarts
        # DN re-registers to the new NN

        A more robust interface would be to have the DN report the total number of volume failures to the NN each heart beat (the same way other volume state is transmitted).
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-22-branch #41 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/41/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-22-branch #41 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/41/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #673 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/673/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #673 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/673/ )
        Konstantin Shvachko made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Patch Available Patch Available Open Open
        4d 20h 35m 2 Eli Collins 03/May/11 00:15
        Open Open Patch Available Patch Available
        8d 1h 2m 3 Eli Collins 03/May/11 00:15
        Patch Available Patch Available Resolved Resolved
        2h 19m 1 Eli Collins 03/May/11 02:34
        Resolved Resolved Closed Closed
        223d 4h 44m 1 Konstantin Shvachko 12/Dec/11 06:19

          People

          • Assignee:
            Eli Collins
            Reporter:
            Eli Collins
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development