Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-4633

TestDFSClientExcludedNodes fails sporadically if excluded nodes cache expires too quickly

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0, 3.0.0-alpha1
    • Fix Version/s: 2.3.0
    • Component/s: hdfs-client, test
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      TestDFSClientExcludedNodes simulates failures of individual data nodes in the client's write pipeline and checks the client's ability to recover. HDFS-4246 added support for periodic "forgiveness" by caching the list of known bad data nodes with a periodic eviction. The test uses a 1 second cache expiration. This sometimes causes failed nodes to be forgiven too fast and violate the assumptions of the test.

      1. HDFS-4633.3.patch
        3 kB
        Chris Nauroth
      2. HDFS-4633.2.patch
        3 kB
        Chris Nauroth
      3. HDFS-4633.1.patch
        3 kB
        Chris Nauroth

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1595 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1595/)
          HDFS-4633. Change attribution in CHANGES.txt to version 2.2.1. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1537345)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1595 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1595/ ) HDFS-4633 . Change attribution in CHANGES.txt to version 2.2.1. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1537345 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #1569 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1569/)
          HDFS-4633. Change attribution in CHANGES.txt to version 2.2.1. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1537345)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1569 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1569/ ) HDFS-4633 . Change attribution in CHANGES.txt to version 2.2.1. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1537345 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #379 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/379/)
          HDFS-4633. Change attribution in CHANGES.txt to version 2.2.1. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1537345)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #379 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/379/ ) HDFS-4633 . Change attribution in CHANGES.txt to version 2.2.1. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1537345 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #4679 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4679/)
          HDFS-4633. Change attribution in CHANGES.txt to version 2.2.1. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1537345)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #4679 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4679/ ) HDFS-4633 . Change attribution in CHANGES.txt to version 2.2.1. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1537345 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          cnauroth Chris Nauroth added a comment -

          I've merged this patch to branch-2 and branch-2.2. I've also updated attribution in CHANGES.txt so that it's listed under release 2.2.1.

          Show
          cnauroth Chris Nauroth added a comment - I've merged this patch to branch-2 and branch-2.2. I've also updated attribution in CHANGES.txt so that it's listed under release 2.2.1.
          Hide
          cnauroth Chris Nauroth added a comment -

          I'm going to merge this down to branch-2 and branch-2.2.

          Show
          cnauroth Chris Nauroth added a comment - I'm going to merge this down to branch-2 and branch-2.2.
          Hide
          hudson Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1386 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1386/)
          HDFS-4633 TestDFSClientExcludedNodes fails sporadically if excluded nodes cache expires too quickly (Chris Nauroth via Sanjay) (Revision 1461846)

          Result = SUCCESS
          sradia : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1461846
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientExcludedNodes.java
          Show
          hudson Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1386 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1386/ ) HDFS-4633 TestDFSClientExcludedNodes fails sporadically if excluded nodes cache expires too quickly (Chris Nauroth via Sanjay) (Revision 1461846) Result = SUCCESS sradia : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1461846 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientExcludedNodes.java
          Hide
          hudson Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1358 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1358/)
          HDFS-4633 TestDFSClientExcludedNodes fails sporadically if excluded nodes cache expires too quickly (Chris Nauroth via Sanjay) (Revision 1461846)

          Result = FAILURE
          sradia : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1461846
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientExcludedNodes.java
          Show
          hudson Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1358 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1358/ ) HDFS-4633 TestDFSClientExcludedNodes fails sporadically if excluded nodes cache expires too quickly (Chris Nauroth via Sanjay) (Revision 1461846) Result = FAILURE sradia : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1461846 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientExcludedNodes.java
          Hide
          hudson Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #169 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/169/)
          HDFS-4633 TestDFSClientExcludedNodes fails sporadically if excluded nodes cache expires too quickly (Chris Nauroth via Sanjay) (Revision 1461846)

          Result = FAILURE
          sradia : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1461846
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientExcludedNodes.java
          Show
          hudson Hudson added a comment - Integrated in Hadoop-Yarn-trunk #169 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/169/ ) HDFS-4633 TestDFSClientExcludedNodes fails sporadically if excluded nodes cache expires too quickly (Chris Nauroth via Sanjay) (Revision 1461846) Result = FAILURE sradia : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1461846 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientExcludedNodes.java
          Hide
          hudson Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #3537 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3537/)
          HDFS-4633 TestDFSClientExcludedNodes fails sporadically if excluded nodes cache expires too quickly (Chris Nauroth via Sanjay) (Revision 1461846)

          Result = SUCCESS
          sradia : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1461846
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientExcludedNodes.java
          Show
          hudson Hudson added a comment - Integrated in Hadoop-trunk-Commit #3537 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3537/ ) HDFS-4633 TestDFSClientExcludedNodes fails sporadically if excluded nodes cache expires too quickly (Chris Nauroth via Sanjay) (Revision 1461846) Result = SUCCESS sradia : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1461846 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientExcludedNodes.java
          Hide
          sanjay.radia Sanjay Radia added a comment -

          +1
          Thanks Chris. Committed.

          Show
          sanjay.radia Sanjay Radia added a comment - +1 Thanks Chris. Committed.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12575419/HDFS-4633.3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4143//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4143//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12575419/HDFS-4633.3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4143//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4143//console This message is automatically generated.
          Hide
          cnauroth Chris Nauroth added a comment -

          Sanjay, thanks for the catch. Here is a new patch that fixes the comments.

          Show
          cnauroth Chris Nauroth added a comment - Sanjay, thanks for the catch. Here is a new patch that fixes the comments.
          Hide
          sanjay.radia Sanjay Radia added a comment -

          In the patch the 2 comments does not match the code
          > + // Forgive nodes in under 10s for this test case.
          Here you changed the comment but forgot to change it again as you lowered the time based on Arpit's feedback.

          >// [Sleeping just in case the restart of the DNs completed < 2s cause
          2s is the older time, you have changed it to 5sec

          Show
          sanjay.radia Sanjay Radia added a comment - In the patch the 2 comments does not match the code > + // Forgive nodes in under 10s for this test case. Here you changed the comment but forgot to change it again as you lowered the time based on Arpit's feedback. >// [Sleeping just in case the restart of the DNs completed < 2s cause 2s is the older time, you have changed it to 5sec
          Hide
          arpitagarwal Arpit Agarwal added a comment -

          Patch looks good, thank you for reducing the timeouts.

          +1

          Show
          arpitagarwal Arpit Agarwal added a comment - Patch looks good, thank you for reducing the timeouts. +1
          Hide
          cnauroth Chris Nauroth added a comment -

          Thanks, Arpit. Here is an updated patch that uses the timing values you recommended. I retested this on multiple machines, and it passed consistently, so I think these settings will work fine.

          Show
          cnauroth Chris Nauroth added a comment - Thanks, Arpit. Here is an updated patch that uses the timing values you recommended. I retested this on multiple machines, and it passed consistently, so I think these settings will work fine.
          Hide
          arpitagarwal Arpit Agarwal added a comment -

          Good find in testExcludedNodesForgiveness. I verified the fix works on Windows and OS X.

          I wonder if we can reduce the test execution time a little by reducing these delays?

              // Forgive nodes in under 10s for this test case.
              conf.setLong(
                  DFSConfigKeys.DFS_CLIENT_WRITE_EXCLUDE_NODES_CACHE_EXPIRY_INTERVAL,
                  10000);
          
              ThreadUtil.sleepAtLeastIgnoreInterrupts(15000);
          

          The test passes reliably on my Windows machine with the former set to 2500ms and the latter to 5000ms. I'm fine if you prefer leaving the higher values to rule out spurious failures though.

          Show
          arpitagarwal Arpit Agarwal added a comment - Good find in testExcludedNodesForgiveness. I verified the fix works on Windows and OS X. I wonder if we can reduce the test execution time a little by reducing these delays? // Forgive nodes in under 10s for this test case . conf.setLong( DFSConfigKeys.DFS_CLIENT_WRITE_EXCLUDE_NODES_CACHE_EXPIRY_INTERVAL, 10000); ThreadUtil.sleepAtLeastIgnoreInterrupts(15000); The test passes reliably on my Windows machine with the former set to 2500ms and the latter to 5000ms. I'm fine if you prefer leaving the higher values to rule out spurious failures though.
          Hide
          cnauroth Chris Nauroth added a comment -

          BTW, big thanks to Arpit Agarwal for spotting the problem of failing to shut down MiniDFSCluster.

          Show
          cnauroth Chris Nauroth added a comment - BTW, big thanks to Arpit Agarwal for spotting the problem of failing to shut down MiniDFSCluster.
          Hide
          cnauroth Chris Nauroth added a comment -

          With this patch, the test passes consistently on every machine I've tried. The changes are:

          1. Guarantee that each test properly shuts down its MiniDFSCluster.
          2. Increase timeouts in test annotations from 10s to 60s. These timeouts were too tight and even caused sporadic failures on my fastest machine.
          3. Increase excluded nodes cache expiry from 1s to 10s. I expect this is plenty of time for any machine to make it through the loop in DFSOutputStream#DataStreamer#nextBlockOutputStream.
          Show
          cnauroth Chris Nauroth added a comment - With this patch, the test passes consistently on every machine I've tried. The changes are: Guarantee that each test properly shuts down its MiniDFSCluster . Increase timeouts in test annotations from 10s to 60s. These timeouts were too tight and even caused sporadic failures on my fastest machine. Increase excluded nodes cache expiry from 1s to 10s. I expect this is plenty of time for any machine to make it through the loop in DFSOutputStream#DataStreamer#nextBlockOutputStream .
          Hide
          cnauroth Chris Nauroth added a comment -

          Here are some additional details. There is a bad interaction between the 1-second cache expiration used by TestDFSClientExcludedNodes#testExcludedNodesForgiveness and the exclusion/retry logic within DFSOutputStream#DataStreamer#nextBlockOutputStream. Here is the sequence of events I observed during a failed test run. Assume 3 data nodes named dn1, dn2, and dn3.

          1. DFSOutputStream writes first block to [dn1, dn2, dn3].
          2. Test stops data nodes [dn1, dn2].
          3. DFSOutputStream attempts writing second block to [dn1, dn2, dn3]. It fails to dn1 and marks it excluded.
          4. DFSOutputStream retries and attempts writing second block to [dn2, dn3]. It fails to dn2 and marks it excluded.
          5. DFSOutputStream retries, but by now, > 1 second has elapsed since dn1 failed. dn1 gets evicted from the cache and it attempts writing second block to [dn1, dn3]. This fails again, so it marks dn1 excluded again.
          6. DFSOutputStream retries, but by now, > 1 second has elapsed since dn2 failed. dn2 gets evicted from the cache and it attempts writing second block to [dn2, dn3]. This fails again, so it marks dn2 excluded again.
          7. At this point, DFSOutputStream#DataStreamer#nextBlockOutputStream has exceeded max block write retries (3). It aborts and throws IOException with "Unable to create new block.".
          Show
          cnauroth Chris Nauroth added a comment - Here are some additional details. There is a bad interaction between the 1-second cache expiration used by TestDFSClientExcludedNodes#testExcludedNodesForgiveness and the exclusion/retry logic within DFSOutputStream#DataStreamer#nextBlockOutputStream . Here is the sequence of events I observed during a failed test run. Assume 3 data nodes named dn1, dn2, and dn3. DFSOutputStream writes first block to [dn1, dn2, dn3] . Test stops data nodes [dn1, dn2] . DFSOutputStream attempts writing second block to [dn1, dn2, dn3] . It fails to dn1 and marks it excluded. DFSOutputStream retries and attempts writing second block to [dn2, dn3] . It fails to dn2 and marks it excluded. DFSOutputStream retries, but by now, > 1 second has elapsed since dn1 failed. dn1 gets evicted from the cache and it attempts writing second block to [dn1, dn3] . This fails again, so it marks dn1 excluded again. DFSOutputStream retries, but by now, > 1 second has elapsed since dn2 failed. dn2 gets evicted from the cache and it attempts writing second block to [dn2, dn3] . This fails again, so it marks dn2 excluded again. At this point, DFSOutputStream#DataStreamer#nextBlockOutputStream has exceeded max block write retries (3). It aborts and throws IOException with "Unable to create new block.".

            People

            • Assignee:
              cnauroth Chris Nauroth
              Reporter:
              cnauroth Chris Nauroth
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development