Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-811

Add metrics, failure reporting and additional tests for HDFS-457

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.22.0
    • Component/s: test
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      HDFS-457 introduced a improvement which allows datanode to continue if a volume for replica storage fails. Previously a datanode resigned if any volume failed.

      Description of HDFS-457

      Current implementation shuts DataNode down completely when one of the configured volumes of the storage fails.
      This is rather wasteful behavior because it decreases utilization (good storage becomes unavailable) and imposes extra load on the system (replication of the blocks from the good volumes). These problems will become even more prominent when we move to mixed (heterogeneous) clusters with many more volumes per Data Node.

      I suggest following additional tests for this improvement.

      #1 Test successive volume failures ( Minimum 4 volumes )
      #2 Test if each volume failure reports reduction in available DFS space and remaining space.
      #3 Test if failure of all volumes on a data nodes leads to the data node failure.
      #4 Test if correcting failed storage disk brings updates and increments available DFS space.

      1. hdfs-811-1.patch
        23 kB
        Eli Collins
      2. hdfs-811-2.patch
        26 kB
        Eli Collins
      3. hdfs-811-3.patch
        34 kB
        Eli Collins
      4. hdfs-811-4.patch
        34 kB
        Eli Collins
      5. hdfs-811-5.patch
        37 kB
        Eli Collins
      6. hdfs-811-6.patch
        37 kB
        Eli Collins

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment -

          another useful thing for a system adminstrator would be:

          #5 Test that the WebUI tells the system adminstrator that one volume of a datanode is unavailable.

          Thsi is needed so that the administrator can monitor (and fix) failed disks in the cluster.

          Show
          dhruba borthakur added a comment - another useful thing for a system adminstrator would be: #5 Test that the WebUI tells the system adminstrator that one volume of a datanode is unavailable. Thsi is needed so that the administrator can monitor (and fix) failed disks in the cluster.
          Hide
          Ravi Phulari added a comment -

          Dhruba thanks for comment,

          #5 Test that the WebUI tells the system adminstrator that one volume of a datanode is unavailable.

          AFIK currently WebUI does not support functionality of explicitly indicating volume failure in data node, so we can not include this test case.

          This is good to have functionality, opening new Jira for it.

          Show
          Ravi Phulari added a comment - Dhruba thanks for comment, #5 Test that the WebUI tells the system adminstrator that one volume of a datanode is unavailable. AFIK currently WebUI does not support functionality of explicitly indicating volume failure in data node, so we can not include this test case. This is good to have functionality, opening new Jira for it.
          Hide
          Boris Shkolnik added a comment -

          I think there is already JIRA on this - HDFS-556

          Show
          Boris Shkolnik added a comment - I think there is already JIRA on this - HDFS-556
          Hide
          Eli Collins added a comment -

          Patch attached.

          • Adds a DN metric to track volume failures
          • Adds DN descriptor state to track failed volumes (used for HDFS-556)
          • Refactors TestDiskError and adds test coverage for individual volume failure behavior, that capacity and liveliness is adjust accordingly, etc.
          Show
          Eli Collins added a comment - Patch attached. Adds a DN metric to track volume failures Adds DN descriptor state to track failed volumes (used for HDFS-556 ) Refactors TestDiskError and adds test coverage for individual volume failure behavior, that capacity and liveliness is adjust accordingly, etc.
          Hide
          Eli Collins added a comment -

          The automated test covers the following but I also verified by hand on a local cluster that:

          1. Making a datadir inaccessible causes the volume to be blacklisted, the capacities and failed volume count adjusted accordingly on the web UI stats (using HDFS-556).
          2. Making the datadir accessible again does not cause it to be used until the datanode is restarted, at which point the failed volume count and capacities are back to their original values
          3. The above but made all datadirs inaccessible, confirmed the datanode is not used and is eventually is reported as dead.
          Show
          Eli Collins added a comment - The automated test covers the following but I also verified by hand on a local cluster that: Making a datadir inaccessible causes the volume to be blacklisted, the capacities and failed volume count adjusted accordingly on the web UI stats (using HDFS-556 ). Making the datadir accessible again does not cause it to be used until the datanode is restarted, at which point the failed volume count and capacities are back to their original values The above but made all datadirs inaccessible, confirmed the datanode is not used and is eventually is reported as dead.
          Hide
          Eli Collins added a comment -

          This blocks HDFS-1161 and we want that in 21 so updating this fixVersion as well.

          Show
          Eli Collins added a comment - This blocks HDFS-1161 and we want that in 21 so updating this fixVersion as well.
          Hide
          Ravi Phulari added a comment -

          Reviewed, Patch looks good.
          +1

          Show
          Ravi Phulari added a comment - Reviewed, Patch looks good. +1
          Hide
          Eli Collins added a comment -

          Patch attached. Adds the tests from HDFS-1161 and incorporates Dhruba's suggestions in that jira.

          Show
          Eli Collins added a comment - Patch attached. Adds the tests from HDFS-1161 and incorporates Dhruba's suggestions in that jira.
          Hide
          Eli Collins added a comment -

          Thanks for taking a look Ravi. Konst asked that the patch for HDFS-1161 apply against trunk instead of depending on this one so I moved the test from that jira to this jira, and also made the tests waits indefinitely for the required failures instead of just a computed amount of time which Dhruba suggested in HDFS-1161.

          Show
          Eli Collins added a comment - Thanks for taking a look Ravi. Konst asked that the patch for HDFS-1161 apply against trunk instead of depending on this one so I moved the test from that jira to this jira, and also made the tests waits indefinitely for the required failures instead of just a computed amount of time which Dhruba suggested in HDFS-1161 .
          Hide
          Konstantin Boudnik added a comment -

          A couple of comments:

          • JavaDoc for the parameter is missing
            + public void updateRegInfo(DatanodeID nodeReg) {
          • same for exception throw in
            + synchronized public void incVolumeFailure(DatanodeID nodeID)
            + throws IOException {
          • Some of new tests have JavaDocs and some aren't. It's better be consistent.
          • it usually a good style to have meaningful messages for the assertions.

          Seems to be good otherwise.

          Show
          Konstantin Boudnik added a comment - A couple of comments: JavaDoc for the parameter is missing + public void updateRegInfo(DatanodeID nodeReg) { same for exception throw in + synchronized public void incVolumeFailure(DatanodeID nodeID) + throws IOException { Some of new tests have JavaDocs and some aren't. It's better be consistent. it usually a good style to have meaningful messages for the assertions. Seems to be good otherwise.
          Hide
          Eli Collins added a comment -

          Thanks for taking a look Kos. Attached patch addresses your feedback. I also moved the new tests from TestDiskError to TestDataNodeVolumeFailure2 since they were getting big and to be consistent with TestDataNodeVolumeFailure, which I modified to be a junit4 style test. Doing a full test-core run with HDFS-1161 applied.

          Show
          Eli Collins added a comment - Thanks for taking a look Kos. Attached patch addresses your feedback. I also moved the new tests from TestDiskError to TestDataNodeVolumeFailure2 since they were getting big and to be consistent with TestDataNodeVolumeFailure, which I modified to be a junit4 style test. Doing a full test-core run with HDFS-1161 applied.
          Hide
          Eli Collins added a comment -

          Updating the summary to reflect that the change also adds volume failure DN metrics and plumbing to report and reset failure counts in the NN.

          Show
          Eli Collins added a comment - Updating the summary to reflect that the change also adds volume failure DN metrics and plumbing to report and reset failure counts in the NN.
          Hide
          Eli Collins added a comment -

          Patch attached. Merges with trunk.

          Show
          Eli Collins added a comment - Patch attached. Merges with trunk.
          Hide
          Konstantin Boudnik added a comment -

          patch is no longer applicable. Could you please refit the patch and run it through test-patch if possible?
          +1 'cause it looks good otherwise.

          Show
          Konstantin Boudnik added a comment - patch is no longer applicable. Could you please refit the patch and run it through test-patch if possible? +1 'cause it looks good otherwise.
          Hide
          Eli Collins added a comment -

          Hey Cos, thanks for taking a look! hdfs-811-4.patch applies on trunk for me. Kicked hudson to test it. You probably tried the previous patch. For some reason jira didn't put that one on top even though it's the latest patch.

          Show
          Eli Collins added a comment - Hey Cos, thanks for taking a look! hdfs-811-4.patch applies on trunk for me. Kicked hudson to test it. You probably tried the previous patch. For some reason jira didn't put that one on top even though it's the latest patch.
          Hide
          Jakob Homan added a comment -

          Before this goes in, there are still quite a few undescribed asserts and a fair amount of unnecessary white space changes that should be removed. Also, TestDataNodeVolumeFailure2? Oy. It that the sequel? Would a more descriptive name be reasonable?

          Show
          Jakob Homan added a comment - Before this goes in, there are still quite a few undescribed asserts and a fair amount of unnecessary white space changes that should be removed. Also, TestDataNodeVolumeFailure2? Oy. It that the sequel? Would a more descriptive name be reasonable?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12445732/hdfs-811-4.patch
          against trunk revision 952861.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 19 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/403/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/403/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/403/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/403/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12445732/hdfs-811-4.patch against trunk revision 952861. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 19 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/403/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/403/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/403/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/403/console This message is automatically generated.
          Hide
          Eli Collins added a comment -

          Thanks for taking a look Jakob. The latest patch mostly adds new code. Two big blocks of test code got shifted left due to removing the surrounding try blocks, but that's not really a white space change. I moved the whitespace changes in an earlier patch over to HDFS-1160, lemme know if I missed any.

          Do you feel it's necessary to add descriptions like "Ensure DN 0 is up" to asserts like the following? I usually only add descriptions where the assert itself isn't clear. Happy to make the change for any asserts you find unclear (always more clear to the author than readers).

          assertTrue(DataNode.isDatanodeUp(dns.get(0)));
          

          The 2nd test class just adds additional DN volume failure test, but needs different startup and teardown methods which is why it's in a different class, so I used TestLeaseRecovery and TestFileAppend as examples which just sequentially number the test classes. Both test classes cover various aspects of volume failure so hard to come up with a name that's not generic. Open to suggestions of course.

          Btw the previous test run was bogus (181 tests failed due to hudson issues).

          Show
          Eli Collins added a comment - Thanks for taking a look Jakob. The latest patch mostly adds new code. Two big blocks of test code got shifted left due to removing the surrounding try blocks, but that's not really a white space change. I moved the whitespace changes in an earlier patch over to HDFS-1160 , lemme know if I missed any. Do you feel it's necessary to add descriptions like "Ensure DN 0 is up" to asserts like the following? I usually only add descriptions where the assert itself isn't clear. Happy to make the change for any asserts you find unclear (always more clear to the author than readers). assertTrue(DataNode.isDatanodeUp(dns.get(0))); The 2nd test class just adds additional DN volume failure test, but needs different startup and teardown methods which is why it's in a different class, so I used TestLeaseRecovery and TestFileAppend as examples which just sequentially number the test classes. Both test classes cover various aspects of volume failure so hard to come up with a name that's not generic. Open to suggestions of course. Btw the previous test run was bogus (181 tests failed due to hudson issues).
          Hide
          Jakob Homan added a comment -

          The test failure lines are most useful when a test fails and you're looking through the output in the test directory, or on Hudson, and may not have the context of the failure, so I'd prefer more than less, but if you're ok with the current level I guess it's fine. I had run a search and saw the TestLeaseRecovery and TestFileAppend files, as well. Don't you think those are poor choices for explaining what's going on in those tests? If there is really no better description, then I guess we'll just have to live with it. As for white-space changes, I'm referring to lines such as 35, 124, 159, 185, 206, 207, 222, 227, 252, 260, etc.

          Show
          Jakob Homan added a comment - The test failure lines are most useful when a test fails and you're looking through the output in the test directory, or on Hudson, and may not have the context of the failure, so I'd prefer more than less, but if you're ok with the current level I guess it's fine. I had run a search and saw the TestLeaseRecovery and TestFileAppend files, as well. Don't you think those are poor choices for explaining what's going on in those tests? If there is really no better description, then I guess we'll just have to live with it. As for white-space changes, I'm referring to lines such as 35, 124, 159, 185, 206, 207, 222, 227, 252, 260, etc.
          Hide
          Konstantin Boudnik added a comment -

          +1 on Jacob point. It isn't about how obvious the assertion is. It is about time one needs to spend on a failure analysis. If there's no message one needs to pull up the source code, find the line, read the context, etc. With meaningful diagnostics message it is usually enough to just glance at the log file.

          Show
          Konstantin Boudnik added a comment - +1 on Jacob point. It isn't about how obvious the assertion is. It is about time one needs to spend on a failure analysis. If there's no message one needs to pull up the source code, find the line, read the context, etc. With meaningful diagnostics message it is usually enough to just glance at the log file.
          Hide
          Eli Collins added a comment -

          I'll update the patch to add descriptions to the asserts and remove the white-space you listed (arg!). I can't think of a better description for the class, I would put the tests in TestDataNodeVolumeFailure if they could easily share the setup, teardown, etc. but open to renaming if you've got a suggestion.

          Show
          Eli Collins added a comment - I'll update the patch to add descriptions to the asserts and remove the white-space you listed (arg!). I can't think of a better description for the class, I would put the tests in TestDataNodeVolumeFailure if they could easily share the setup, teardown, etc. but open to renaming if you've got a suggestion.
          Hide
          Jakob Homan added a comment -

          Also, it looks like the new test takes about 75 sec to run. Since it's going into src/test/hdfs rather than src/test/unit (which is still a loney, vacant neighborhood...), this is ok.

          Show
          Jakob Homan added a comment - Also, it looks like the new test takes about 75 sec to run. Since it's going into src/test/hdfs rather than src/test/unit (which is still a loney, vacant neighborhood...), this is ok.
          Hide
          Jakob Homan added a comment -

          Canceling patch pending update...

          Show
          Jakob Homan added a comment - Canceling patch pending update...
          Hide
          Eli Collins added a comment -

          Patch attached.

          Merges in trunk, adds descriptions to the asserts, and gives the new test class a better name (TestDataNodeVolumeFailureReporting).

          Show
          Eli Collins added a comment - Patch attached. Merges in trunk, adds descriptions to the asserts, and gives the new test class a better name (TestDataNodeVolumeFailureReporting).
          Hide
          Eli Collins added a comment -
               [exec] 
               [exec] -1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 19 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     -1 javac.  The applied patch generated 22 javac compiler warnings (more than the trunk's current 21 warnings).
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     -1 release audit.  The applied patch generated 103 release audit warnings (more than the trunk's current 1 warnings).
               [exec] 
               [exec]     +1 system test framework.  The patch passed system test framework compile.
               [exec] 
          

          The release audit warning is bogus, looking at patchReleaseAuditWarnings.txt it's warning about files that are not modified by this patch (eg xml files not having an AL header).

          Show
          Eli Collins added a comment - [exec] [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 19 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] -1 javac. The applied patch generated 22 javac compiler warnings (more than the trunk's current 21 warnings). [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 103 release audit warnings (more than the trunk's current 1 warnings). [exec] [exec] +1 system test framework. The patch passed system test framework compile. [exec] The release audit warning is bogus, looking at patchReleaseAuditWarnings.txt it's warning about files that are not modified by this patch (eg xml files not having an AL header).
          Hide
          Eli Collins added a comment -

          Forgot to mention I don't see any javac warnings with the patch either.

          Show
          Eli Collins added a comment - Forgot to mention I don't see any javac warnings with the patch either.
          Hide
          Konstantin Boudnik added a comment -

          +1 patch looks good. I like that test were changed to be JUnit4 too.

          It seems that HDFS patch queue isn't connected to the test-patch on Hudson. Thus you might want to run the validation manually and post here.

          Show
          Konstantin Boudnik added a comment - +1 patch looks good. I like that test were changed to be JUnit4 too. It seems that HDFS patch queue isn't connected to the test-patch on Hudson. Thus you might want to run the validation manually and post here.
          Hide
          Jakob Homan added a comment -

          Forgot to mention I don't see any javac warnings with the patch either.

          The testpatch javac warning flag is probably from the call to the deprecated MiniDFSCluster constructor on line 411 of the patch...

          Show
          Jakob Homan added a comment - Forgot to mention I don't see any javac warnings with the patch either. The testpatch javac warning flag is probably from the call to the deprecated MiniDFSCluster constructor on line 411 of the patch...
          Hide
          Eli Collins added a comment -

          Thanks Cos and Jakob. New patch fixes the warning. I'll commit this.

              [exec] 
               [exec] -1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 19 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     -1 release audit.  The applied patch generated 103 release audit warnings (more than the trunk's current 1 warnings).
               [exec] 
               [exec]     +1 system test framework.  The patch passed system test framework compile.
               [exec] 
          
          Show
          Eli Collins added a comment - Thanks Cos and Jakob. New patch fixes the warning. I'll commit this. [exec] [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 19 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 103 release audit warnings (more than the trunk's current 1 warnings). [exec] [exec] +1 system test framework. The patch passed system test framework compile. [exec]
          Hide
          Eli Collins added a comment -

          I've committed this.

          Show
          Eli Collins added a comment - I've committed this.

            People

            • Assignee:
              Eli Collins
              Reporter:
              Ravi Phulari
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development