Details

    • Type: Test Test
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.22.0
    • Component/s: namenode, test
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The existing replication tests (TestBlocksWithNotEnoughRacks, TestPendingReplication, TestOverReplicatedBlocks, TestReplicationPolicy, TestUnderReplicatedBlocks, and TestReplication) are missing tests for rack policy violations. This jira adds the following tests which I created when generating a new patch for HDFS-15.

      • Test that blocks that have a sufficient number of total replicas, but are not replicated cross rack, get replicated cross rack when a rack becomes available.
      • Test that new blocks for an underreplicated file will get replicated cross rack.
      • Mark a block as corrupt, test that when it is re-replicated that it is still replicated across racks.
      • Reduce the replication factor of a file, making sure that the only block that is across racks is not removed when deleting replicas.
      • Test that when a block is replicated because a replica is lost due to host failure the the rack policy is preserved.
      • Test that when the execss replicas of a block are reduced due to a node re-joining the cluster the rack policy is not violated.
      • Test that rack policy is still respected when blocks are replicated due to node decommissioning.
      • Test that rack policy is still respected when blocks are replicated due to node decommissioning, even when the blocks are over-replicated.
      1. hdfs-1562-1.patch
        26 kB
        Eli Collins
      2. hdfs-1562-2.patch
        38 kB
        Eli Collins
      3. hdfs-1562-3.patch
        58 kB
        Eli Collins
      4. hdfs-1562-4.patch
        59 kB
        Eli Collins
      5. hdfs-1562-5.patch
        59 kB
        Eli Collins

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-22-branch #41 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/41/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-22-branch #41 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/41/ )
          Hide
          Eli Collins added a comment -

          I merged this to 0.22.

          Show
          Eli Collins added a comment - I merged this to 0.22.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #650 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/650/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #650 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/650/ )
          Hide
          Matt Foley added a comment -

          It was moved to MiniDFSCluster. Same signature. Now it is an API!

          Show
          Matt Foley added a comment - It was moved to MiniDFSCluster. Same signature. Now it is an API!
          Hide
          Tsz Wo Nicholas Sze added a comment -

          This is an incompatible change! See MAPREDUCE-2448.

          Just kidding. MapReduce tests should not use it in the first place. TestDatanodeBlockScanner is never an API.

          Show
          Tsz Wo Nicholas Sze added a comment - This is an incompatible change! See MAPREDUCE-2448 . Just kidding. MapReduce tests should not use it in the first place. TestDatanodeBlockScanner is never an API.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #601 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/601/)
          HDFS-1562. Add rack policy tests. Contributed by Eli Collins

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #601 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/601/ ) HDFS-1562 . Add rack policy tests. Contributed by Eli Collins
          Hide
          Eli Collins added a comment -

          Test failures are unrelated. I've committed this.

          Thanks Todd and Matt!

          Show
          Eli Collins added a comment - Test failures are unrelated. I've committed this. Thanks Todd and Matt!
          Hide
          Matt Foley added a comment -

          +1 Looks great! Thanks, Eli.

          Show
          Matt Foley added a comment - +1 Looks great! Thanks, Eli.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12477070/hdfs-1562-5.patch
          against trunk revision 1095830.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 49 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.TestFileAppend4
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.hdfs.TestLargeBlock
          org.apache.hadoop.hdfs.TestWriteConfigurationToDFS

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/408//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/408//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/408//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477070/hdfs-1562-5.patch against trunk revision 1095830. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 49 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestFileAppend4 org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.TestLargeBlock org.apache.hadoop.hdfs.TestWriteConfigurationToDFS +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/408//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/408//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/408//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          +1, looks good to me. Will let Matt take another swing through before commit.

          Show
          Todd Lipcon added a comment - +1, looks good to me. Will let Matt take another swing through before commit.
          Hide
          Eli Collins added a comment -

          Forgot to address point #5. Modified corruptReplica to return boolean and assert if it ever sees more than one copy of the block file on a datanode, which shouldn't happen.

          Show
          Eli Collins added a comment - Forgot to address point #5. Modified corruptReplica to return boolean and assert if it ever sees more than one copy of the block file on a datanode, which shouldn't happen.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12477058/hdfs-1562-4.patch
          against trunk revision 1095830.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 49 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.TestFileConcurrentReader

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/404//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/404//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/404//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477058/hdfs-1562-4.patch against trunk revision 1095830. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 49 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestFileConcurrentReader +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/404//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/404//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/404//console This message is automatically generated.
          Hide
          Eli Collins added a comment -

          Thanks Matt.

          1. DFSTestUtil#readFile will reads as much data as the file has, 1024 is the buffer size used by copyBytes not the amount to read.

          2. Good point, renamed.

          3. Good suggestions. Incorporated them all.

          4. The current timeout is sufficient for this test as I lowered the necessary intervals in the cluster config (eg to heartbeat more quickly), but now that these are generic test functions I think you're right, they should have more slack. I'll double to 20s.

          6. Changed the message to "All replicas not corrupted".

          7. Good point. I modifed this API to just return needed replications for the given block. Originally these were separated checks that got coallesced into one API via the NameNodeAdapter which is why it wasn't per-block, but checking per-block is better anyway.

          Patch attached.

          Show
          Eli Collins added a comment - Thanks Matt. 1. DFSTestUtil#readFile will reads as much data as the file has, 1024 is the buffer size used by copyBytes not the amount to read. 2. Good point, renamed. 3. Good suggestions. Incorporated them all. 4. The current timeout is sufficient for this test as I lowered the necessary intervals in the cluster config (eg to heartbeat more quickly), but now that these are generic test functions I think you're right, they should have more slack. I'll double to 20s. 6. Changed the message to "All replicas not corrupted". 7. Good point. I modifed this API to just return needed replications for the given block. Originally these were separated checks that got coallesced into one API via the NameNodeAdapter which is why it wasn't per-block, but checking per-block is better anyway. Patch attached.
          Hide
          Matt Foley added a comment -

          This is a fairly large patch, so I put it on ReviewBoard to see a clear diff: https://reviews.apache.org/r/645/

          I really like the nice clean new TestBlocksWithNotEnoughRacks, using the methods Eli added to DFSTestUtil, and I commend the work he put into refactoring those methods so they can be re-usable.

          Since the DFSTestUtil methods are intended for re-use, I went over them in some detail, and have the following comments:

          1. DFSTestUtil#readFile(FileSystem fs, Path fileName)
          The previously existing readFile(File f) reads as much data as the file has. This routine only reads 1KB at most. Should they both do the same, whichever is better?

          2. isBlockCorrupt() will only return true if ALL replicas of the block are corrupt, so I suggest naming it something like "isAllBlockReplicasCorrupt()"

          3. I strongly feel that timeouts should throw TimeoutException, and (ii) message ALL the values being waited on, not just one or none. The following "wait for condition" methods don't follow these rules:

          (a) waitForReplication(MiniDFSCluster cluster, Block b, int racks, int replicas, int neededReplicas)
          The current code does a println that only says a timeout occurred, then falls into three asserts, only one of which will trigger. Asserts imply a logic error, while TimeoutException implies a timeout occurred, and the exception message should include the current values of all three waited-on values, which are likely inter-related for debug purposes.

          Also, "&& count < 10" should be "&& count < ATTEMPTS"

          (b) waitCorruptReplicas() should use TimeoutException, and the error msg should say the failure state of the value being waited on (repls).

          (c) same for waitForDecommission()

          4. In the above wait loops, 10 seconds is the timeout. Is that long enough? It is barely three 3-sec cycles. Would 20sec. be better?

          5. corruptReplica() - The original code in TestDatanodeBlockScanner searched for and corrupted all corresponding block files on a datanode, but returned binary true/false. This proposed code returns the actual number of such block files found. That might not be a good idea, as clients (such as corruptBlockOnDataNodes()) probably assume the sum over all datanodes will equal the number of replicas.

          Perhaps it would be best for corruptReplica() itself to throw an exception if count > 1, rather than rely on every client to do so? Then it could just return a boolean for the two valid counts (0 and 1). IIRC, every client except corruptBlockOnDataNodes() is in fact testing that the result == 1.

          6. doTestEntirelyCorruptFile() line 261 asserts "equals" but the error message says "too few blocks", implying "less than". One or the other should be changed to agree.

          7. NameNodeAdapter#getReplicaInfo(NameNode namenode, Block b) is supposed to return info about specific block b,
          but the 3rd element of the tuple is "ns.blockManager.neededReplications.size()", which is for all blocks.
          Wouldn't it be appropriate to check ns.blockManager.neededReplications.contains(b), and if so iterate to find the count?

          Sorry this is long. Overall it's a great piece of work.

          Show
          Matt Foley added a comment - This is a fairly large patch, so I put it on ReviewBoard to see a clear diff: https://reviews.apache.org/r/645/ I really like the nice clean new TestBlocksWithNotEnoughRacks, using the methods Eli added to DFSTestUtil, and I commend the work he put into refactoring those methods so they can be re-usable. Since the DFSTestUtil methods are intended for re-use, I went over them in some detail, and have the following comments: 1. DFSTestUtil#readFile(FileSystem fs, Path fileName) The previously existing readFile(File f) reads as much data as the file has. This routine only reads 1KB at most. Should they both do the same, whichever is better? 2. isBlockCorrupt() will only return true if ALL replicas of the block are corrupt, so I suggest naming it something like "isAllBlockReplicasCorrupt()" 3. I strongly feel that timeouts should throw TimeoutException, and (ii) message ALL the values being waited on, not just one or none. The following "wait for condition" methods don't follow these rules: (a) waitForReplication(MiniDFSCluster cluster, Block b, int racks, int replicas, int neededReplicas) The current code does a println that only says a timeout occurred, then falls into three asserts, only one of which will trigger. Asserts imply a logic error, while TimeoutException implies a timeout occurred, and the exception message should include the current values of all three waited-on values, which are likely inter-related for debug purposes. Also, "&& count < 10" should be "&& count < ATTEMPTS" (b) waitCorruptReplicas() should use TimeoutException, and the error msg should say the failure state of the value being waited on (repls). (c) same for waitForDecommission() 4. In the above wait loops, 10 seconds is the timeout. Is that long enough? It is barely three 3-sec cycles. Would 20sec. be better? 5. corruptReplica() - The original code in TestDatanodeBlockScanner searched for and corrupted all corresponding block files on a datanode, but returned binary true/false. This proposed code returns the actual number of such block files found. That might not be a good idea, as clients (such as corruptBlockOnDataNodes()) probably assume the sum over all datanodes will equal the number of replicas. Perhaps it would be best for corruptReplica() itself to throw an exception if count > 1, rather than rely on every client to do so? Then it could just return a boolean for the two valid counts (0 and 1). IIRC, every client except corruptBlockOnDataNodes() is in fact testing that the result == 1. 6. doTestEntirelyCorruptFile() line 261 asserts "equals" but the error message says "too few blocks", implying "less than". One or the other should be changed to agree. 7. NameNodeAdapter#getReplicaInfo(NameNode namenode, Block b) is supposed to return info about specific block b, but the 3rd element of the tuple is "ns.blockManager.neededReplications.size()", which is for all blocks. Wouldn't it be appropriate to check ns.blockManager.neededReplications.contains(b), and if so iterate to find the count? Sorry this is long. Overall it's a great piece of work.
          Hide
          Matt Foley added a comment -

          After discussion with Eli, I moved the bug fix for TestDatanodeBlockScanner to HDFS-1855, and the code improvements to HDFS-1856. These changes are remarkably non-intersecting with the refactoring offered in HDFS-1562.

          Show
          Matt Foley added a comment - After discussion with Eli, I moved the bug fix for TestDatanodeBlockScanner to HDFS-1855 , and the code improvements to HDFS-1856 . These changes are remarkably non-intersecting with the refactoring offered in HDFS-1562 .
          Hide
          Matt Foley added a comment -

          Hi Eli, didn't realize you were going to look at TestDatanodeBlockScanner too. I did extensive mods to it as part of HDFS-1295. Let's take a look and compare.

          Show
          Matt Foley added a comment - Hi Eli, didn't realize you were going to look at TestDatanodeBlockScanner too. I did extensive mods to it as part of HDFS-1295 . Let's take a look and compare.
          Hide
          Eli Collins added a comment -

          The contrib test failures are unrelated. I think the javadoc warning is bogus, here's what test-patch run with hdfs-1562-3.patch yields on my machine:

               [exec] 
               [exec] +1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 49 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.
               [exec] 
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
               [exec] 
               [exec]     +1 system test framework.  The patch passed system test framework compile.
               [exec] 
          
          Show
          Eli Collins added a comment - The contrib test failures are unrelated. I think the javadoc warning is bogus, here's what test-patch run with hdfs-1562-3.patch yields on my machine: [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 49 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system test framework. The patch passed system test framework compile. [exec]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12476661/hdfs-1562-3.patch
          against trunk revision 1094748.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 49 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/384//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/384//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/384//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476661/hdfs-1562-3.patch against trunk revision 1094748. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 49 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/384//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/384//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/384//console This message is automatically generated.
          Hide
          Eli Collins added a comment -

          Hey Matt,

          Thanks for reviewing! Updated patch attached.

          • Addresses HDFS-1828 by making waitForReplication check for exact values
          • Added a comment by each config option being set with rationale
          • Folds all utility methods into DFSTestUtil. I used the NameNodeAdatper for waitForReplication since it uses protected methods. This method is needed in addition to waitReplication because it checks for specific values of neededReplications not exposed via the FileSystem API (the test is more fine-grain).
          • Good point WRT waitForCorruptReplicas. The test actually has the opposite problem, it explicitly attempts to report the corrupt replica from the client (via file access) because the datanode checking takes so long (the DataBlockScanner period is measured in hours, it doesn't execute during the test runs). In the test, after the client reports the corrupt block to the Namenode it immediately queries the namenode state to check that a corrupt replica has been identified so it can wait for replication. After looping this test however I discovered a problem with this approach too, sometimes the client only accesses the non-corrupt block location and therefore doesn't trigger the detection of the corrupt replica. The code for testing corrupt replicas in TestDatanodeBlockScanner (restart the DN which will trigger block scanning) looks sound, I refatored it out to a new method (DFSTestUtil#waitCorruptReplicas) and used it here.
          • Also refactored TestDatanodeBlockScanner to use waitReplication and new methods waitCorruptReplicas and isBlockCorrupt.
          • Removes TestDataNodeBlockScanner#corruptReplica in favor of MiniDFSCluster#corruptReplica (same implementation)

          I've looped the test using this patch and so far have seen no failures.

          Thanks,
          Eli

          Show
          Eli Collins added a comment - Hey Matt, Thanks for reviewing! Updated patch attached. Addresses HDFS-1828 by making waitForReplication check for exact values Added a comment by each config option being set with rationale Folds all utility methods into DFSTestUtil. I used the NameNodeAdatper for waitForReplication since it uses protected methods. This method is needed in addition to waitReplication because it checks for specific values of neededReplications not exposed via the FileSystem API (the test is more fine-grain). Good point WRT waitForCorruptReplicas. The test actually has the opposite problem, it explicitly attempts to report the corrupt replica from the client (via file access) because the datanode checking takes so long (the DataBlockScanner period is measured in hours, it doesn't execute during the test runs). In the test, after the client reports the corrupt block to the Namenode it immediately queries the namenode state to check that a corrupt replica has been identified so it can wait for replication. After looping this test however I discovered a problem with this approach too, sometimes the client only accesses the non-corrupt block location and therefore doesn't trigger the detection of the corrupt replica. The code for testing corrupt replicas in TestDatanodeBlockScanner (restart the DN which will trigger block scanning) looks sound, I refatored it out to a new method (DFSTestUtil#waitCorruptReplicas) and used it here. Also refactored TestDatanodeBlockScanner to use waitReplication and new methods waitCorruptReplicas and isBlockCorrupt. Removes TestDataNodeBlockScanner#corruptReplica in favor of MiniDFSCluster#corruptReplica (same implementation) I've looped the test using this patch and so far have seen no failures. Thanks, Eli
          Hide
          Matt Foley added a comment -

          Hi Eli, thanks for pointing out the relationship between these two bugs. If you wish, since you are effectively rewriting the whole unit test file, I'll withdraw my patch except for a one-line fix which should not interfere with auto-merge when you submit.

          That said, I think your patch may be subject to the same problem as I fix in HDFS-1828:

          The primary problem in HDFS-1828 was that testSufficientlyReplicatedBlocksWithNotEnoughRacks() waited "while ((numRacks < 2) || (curReplicas != REPLICATION_FACTOR) || (neededReplicationSize > 0))" [line 79], and then asserted "(curReplicas == REPLICATION_FACTOR)" [line 95]; when in fact it was appropriate under the circumstances of the test, to expect curReplicas == REPLICATION_FACTOR+1, transiently.

          It looks like the same issue remains in your patch, because in method waitForReplication(), it waits "while ((curRacks < racks || curReplicas < replicas || curNeededReplicas > neededReplicas) && count < 10)", and then does "assertEquals(replicas, curReplicas)". So it will have the same problem, unless you never use this in a context where curReplicas > replicas might occur.

          A couple additional suggestions:

          1. You added a waitForReplication() method. Can you instead use DFSTestUtil.waitReplication()? (and BTW, this method correctly checks for replication being != rather than < expected value.) Or if you need the block-oriented signature of your version, can you consider adding it to DFSTestUtil instead of leaving it just in the one unit test module?

          2. I'm concerned about waitForCorruptReplicas(), because it is polling for a problematic condition that is supposed to be self-healing, and uses a fairly coarse poll frequency (a whole second). It is possible for such a test to "miss" the condition it is trying to catch. See HDFS-1806, where I just fixed such a problem by changing a polling frequency from 100ms to 5ms.

          Now, I haven't had time to fully understand the tests in your new version. It may be that you are controlling for other parameters, such as the values of DFS_HEARTBEAT_INTERVAL, DFS_BLOCKREPORT_INTERVAL, and DFS_NAMENODE_REPLICATION_INTERVAL, that would prevent the condition from self-healing in the time period over which you are waiting for it. But I have seen corrupt replicas be recognized and eliminated in less than a second, on a tiny cluster under proper intersection of events. Since such issues become long-lived intermittent false positives for lots of people on Hudson I hope you don't mind my asking you to reason through an explanation that this construct can't miss its condition. Thanks.

          Show
          Matt Foley added a comment - Hi Eli, thanks for pointing out the relationship between these two bugs. If you wish, since you are effectively rewriting the whole unit test file, I'll withdraw my patch except for a one-line fix which should not interfere with auto-merge when you submit. That said, I think your patch may be subject to the same problem as I fix in HDFS-1828 : The primary problem in HDFS-1828 was that testSufficientlyReplicatedBlocksWithNotEnoughRacks() waited "while ((numRacks < 2) || (curReplicas != REPLICATION_FACTOR) || (neededReplicationSize > 0))" [line 79] , and then asserted "(curReplicas == REPLICATION_FACTOR)" [line 95] ; when in fact it was appropriate under the circumstances of the test, to expect curReplicas == REPLICATION_FACTOR+1, transiently. It looks like the same issue remains in your patch, because in method waitForReplication(), it waits "while ((curRacks < racks || curReplicas < replicas || curNeededReplicas > neededReplicas) && count < 10)", and then does "assertEquals(replicas, curReplicas)". So it will have the same problem, unless you never use this in a context where curReplicas > replicas might occur. A couple additional suggestions: 1. You added a waitForReplication() method. Can you instead use DFSTestUtil.waitReplication()? (and BTW, this method correctly checks for replication being != rather than < expected value.) Or if you need the block-oriented signature of your version, can you consider adding it to DFSTestUtil instead of leaving it just in the one unit test module? 2. I'm concerned about waitForCorruptReplicas(), because it is polling for a problematic condition that is supposed to be self-healing, and uses a fairly coarse poll frequency (a whole second). It is possible for such a test to "miss" the condition it is trying to catch. See HDFS-1806 , where I just fixed such a problem by changing a polling frequency from 100ms to 5ms. Now, I haven't had time to fully understand the tests in your new version. It may be that you are controlling for other parameters, such as the values of DFS_HEARTBEAT_INTERVAL, DFS_BLOCKREPORT_INTERVAL, and DFS_NAMENODE_REPLICATION_INTERVAL, that would prevent the condition from self-healing in the time period over which you are waiting for it. But I have seen corrupt replicas be recognized and eliminated in less than a second, on a tiny cluster under proper intersection of events. Since such issues become long-lived intermittent false positives for lots of people on Hudson I hope you don't mind my asking you to reason through an explanation that this construct can't miss its condition. Thanks.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12476149/hdfs-1562-2.patch
          against trunk revision 1091515.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 33 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.TestFileConcurrentReader

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/350//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/350//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/350//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476149/hdfs-1562-2.patch against trunk revision 1091515. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 33 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestFileConcurrentReader -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/350//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/350//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/350//console This message is automatically generated.
          Hide
          Eli Collins added a comment -

          I think this patch addresses HDFS-1828.

          Show
          Eli Collins added a comment - I think this patch addresses HDFS-1828 .
          Hide
          Eli Collins added a comment -

          Updated patch attached.

          Addresses Todd's feedback. The final assert in testCorruptBlockRereplicatedAcrossRacks is commented out because the block is over-replicated (has 3 replicas not 2), I think this is because the corrupted replica gets deleted and then scheduled with a new replica even though another replica was already scheduled. Need to verify.

          Note that I modified the MiniDFS cluster API for corrupting a block to indicate whether it actually corrupted a block, currently it does not indicate this so a test could silently fail. I modified TestCrcCorruption and TestReplication to assert that blocks were in fact corrupted.

          Show
          Eli Collins added a comment - Updated patch attached. Addresses Todd's feedback. The final assert in testCorruptBlockRereplicatedAcrossRacks is commented out because the block is over-replicated (has 3 replicas not 2), I think this is because the corrupted replica gets deleted and then scheduled with a new replica even though another replica was already scheduled. Need to verify. Note that I modified the MiniDFS cluster API for corrupting a block to indicate whether it actually corrupted a block, currently it does not indicate this so a test could silently fail. I modified TestCrcCorruption and TestReplication to assert that blocks were in fact corrupted.
          Hide
          Konstantin Boudnik added a comment -

          This patch needs to be rebased against trunk.

          Show
          Konstantin Boudnik added a comment - This patch needs to be rebased against trunk.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12467146/hdfs-1562-1.patch
          against trunk revision 1056206.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 17 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/88//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/88//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/88//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12467146/hdfs-1562-1.patch against trunk revision 1056206. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 17 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/88//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/88//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/88//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Few notes:

          • in TestBlocksWithNotEnoughRacks.waitForReplication, I think you need to take the FSN lock before calling things like blockManager.countNodes, neededReplications.size, etc, since those functions aren't inherently threadsafe.
          • While you're at it, maybe change the while loop to do { ... }

            while () so you don't need to duplicate the assignments of curRacks, curReplicas, curNeededReplicas?

          • waitForDecomission will wait forever - it should have a timeout after some number of seconds like waitReplication does.
          • testUnderReplicatedUsesNewRacks: would be good to add another test like this with replication 1, increase to 2 - then it would be a substantially different test than the one above, because it doesn't start out in neededReplication, right?
          • testCorruptBlockRereplicatedAcrossRacks: any way to verify here that it ended up with a clean copy of the block? ie that it's not just claiming "oh yea, I'm on 2 racks" even though rack2 is a corrupt copy. Perhaps this test could be done by actually corrupting the block on disk (there's some code for this in one of the various test utils), then triggering the DN's block scanner. Then wait for it to fix itself.
          • "block that is across racks is not removed when deleting replicas" - s/block/replica/
          • "Test that when the execss replicas of" - typo "excess"
          Show
          Todd Lipcon added a comment - Few notes: in TestBlocksWithNotEnoughRacks.waitForReplication, I think you need to take the FSN lock before calling things like blockManager.countNodes, neededReplications.size, etc, since those functions aren't inherently threadsafe. While you're at it, maybe change the while loop to do { ... } while () so you don't need to duplicate the assignments of curRacks, curReplicas, curNeededReplicas? waitForDecomission will wait forever - it should have a timeout after some number of seconds like waitReplication does. testUnderReplicatedUsesNewRacks: would be good to add another test like this with replication 1, increase to 2 - then it would be a substantially different test than the one above, because it doesn't start out in neededReplication, right? testCorruptBlockRereplicatedAcrossRacks: any way to verify here that it ended up with a clean copy of the block? ie that it's not just claiming "oh yea, I'm on 2 racks" even though rack2 is a corrupt copy. Perhaps this test could be done by actually corrupting the block on disk (there's some code for this in one of the various test utils), then triggering the DN's block scanner. Then wait for it to fix itself. "block that is across racks is not removed when deleting replicas" - s/block/replica/ "Test that when the execss replicas of" - typo "excess"
          Hide
          Eli Collins added a comment -

          The patch also converts several checks for negative values of the # of replicas of a block to asserts as this value should never be negative.

          Show
          Eli Collins added a comment - The patch also converts several checks for negative values of the # of replicas of a block to asserts as this value should never be negative.
          Hide
          Eli Collins added a comment -

          Patch attached.

          Show
          Eli Collins added a comment - Patch attached.

            People

            • Assignee:
              Eli Collins
              Reporter:
              Eli Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development