Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-6631

TestPread#testHedgedReadLoopTooManyTimes fails intermittently.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0, 2.5.0
    • Fix Version/s: 2.5.0
    • Component/s: hdfs-client, test
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      TestPread#testHedgedReadLoopTooManyTimes fails intermittently. It looks like a race condition on counting the expected number of loop iterations. I can repro the test failure more consistently on Windows.

      1. HDFS-6631.txt
        3 kB
        Liang Xie
      2. org.apache.hadoop.hdfs.TestPread-output.txt
        552 kB
        Chris Nauroth

        Issue Links

          Activity

          Chris Nauroth created issue -
          Chris Nauroth made changes -
          Field Original Value New Value
          Assignee Liang Xie [ xieliang007 ]
          Hide
          Chris Nauroth added a comment -

          Here is the stack trace from a failed test run.

          testHedgedReadLoopTooManyTimes(org.apache.hadoop.hdfs.TestPread)  Time elapsed:0.918 sec  <<< FAILURE!
          java.lang.AssertionError: expected:<3> but was:<2>
                  at org.junit.Assert.fail(Assert.java:88)
                  at org.junit.Assert.failNotEquals(Assert.java:743)
                  at org.junit.Assert.assertEquals(Assert.java:118)
                  at org.junit.Assert.assertEquals(Assert.java:555)
                  at org.junit.Assert.assertEquals(Assert.java:542)
                  at org.apache.hadoop.hdfs.TestPread.testHedgedReadLoopTooManyTimes(TestPread.java:329)
          
          Show
          Chris Nauroth added a comment - Here is the stack trace from a failed test run. testHedgedReadLoopTooManyTimes(org.apache.hadoop.hdfs.TestPread) Time elapsed:0.918 sec <<< FAILURE! java.lang.AssertionError: expected:<3> but was:<2> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.hdfs.TestPread.testHedgedReadLoopTooManyTimes(TestPread.java:329)
          Chris Nauroth made changes -
          Link This issue relates to HDFS-6591 [ HDFS-6591 ]
          Chris Nauroth made changes -
          Link This issue relates to HDFS-6614 [ HDFS-6614 ]
          Hide
          Liang Xie added a comment -

          I could not repro on my dev box so far. probably it related with this code in case "Thread.sleep(hedgedReadTimeoutMillis + 1);" ? if almost reach the hedged read threshold (50ms), then a slow down,e.g. gc occur, or scheduler related activity, so when this testing continues to run after 50+1 ms with cpu begin to be scheduled again, then the counter can be 2.
          It's just a guess, maybe a whole test log file will help us diagonise it. Chris, do you still have that log outfile copy?
          ps: if we enable the log level to ALL in this testing, i think it will very be easier to understand all stuff

          Show
          Liang Xie added a comment - I could not repro on my dev box so far. probably it related with this code in case "Thread.sleep(hedgedReadTimeoutMillis + 1);" ? if almost reach the hedged read threshold (50ms), then a slow down,e.g. gc occur, or scheduler related activity, so when this testing continues to run after 50+1 ms with cpu begin to be scheduled again, then the counter can be 2. It's just a guess, maybe a whole test log file will help us diagonise it. Chris, do you still have that log outfile copy? ps: if we enable the log level to ALL in this testing, i think it will very be easier to understand all stuff
          Hide
          Chris Nauroth added a comment -

          Here is a log file from running "mvn -o clean test -Dtest=TestPread#testHedgedReadLoopTooManyTimes" and seeing the test fail, with log level set to ALL.

          Show
          Chris Nauroth added a comment - Here is a log file from running "mvn -o clean test -Dtest=TestPread#testHedgedReadLoopTooManyTimes" and seeing the test fail, with log level set to ALL.
          Chris Nauroth made changes -
          Hide
          Liang Xie added a comment -

          I see, compared with my dev box logfile, found inside attached "org.apache.hadoop.hdfs.TestPread-output.txt" file, it did not trigger a real hedged read.
          I only could find log like "Waited 50ms to read from 127.0.0.1:xxxxx spawning hedged read" in my logfile.
          In your file, the execution sequence is :
          >read from 127.0.0.1:53908 < here the counter is 1
          ->throw Checksum Exception
          >read from 127.0.0.1:53919 < here the counter is 2
          ->return result , line 1127
          that means all two read path gone to the "if (futures.isEmpty()) {" flow (L1112)

          so the root question is if we set hedged.read.threshold = 50ms, and Mockito.doAnswer has a "Thread.sleep(50+1)", this statement:

                    Future<ByteBuffer> future = hedgedService.poll(
                        dfsClient.getHedgedReadTimeout(), TimeUnit.MILLISECONDS);
          

          In my dev box, it did just like Javadoc says:

          Retrieves and removes the Future representing the next completed task, waiting if necessary up to the 
           specified wait time if none are yet present.
          Parameters:
          	timeout how long to wait before giving up, in units of unit
          	unit a TimeUnit determining how to interpret the timeout parameter
          Returns:
          	 the Future representing the next completed task or null if the specified waiting time elapses before 
          	  one is present
          Throws:
          	InterruptedException - if interrupted while waiting
          

          so the future will be null.

          but in Chris's box, the exception from thread pool will jump out firstly, so gone to L1140 directly: "catch (ExecutionException e)"

          so per my current understanding, it should be related with os thread schedule (granularity) , we probably need to enlarge the Mockito sleep interval.

          Show
          Liang Xie added a comment - I see, compared with my dev box logfile, found inside attached "org.apache.hadoop.hdfs.TestPread-output.txt" file, it did not trigger a real hedged read. I only could find log like "Waited 50ms to read from 127.0.0.1:xxxxx spawning hedged read" in my logfile. In your file, the execution sequence is : >read from 127.0.0.1:53908 < here the counter is 1 ->throw Checksum Exception >read from 127.0.0.1:53919 < here the counter is 2 ->return result , line 1127 that means all two read path gone to the "if (futures.isEmpty()) {" flow (L1112) so the root question is if we set hedged.read.threshold = 50ms, and Mockito.doAnswer has a "Thread.sleep(50+1)", this statement: Future<ByteBuffer> future = hedgedService.poll( dfsClient.getHedgedReadTimeout(), TimeUnit.MILLISECONDS); In my dev box, it did just like Javadoc says: Retrieves and removes the Future representing the next completed task, waiting if necessary up to the specified wait time if none are yet present. Parameters: timeout how long to wait before giving up, in units of unit unit a TimeUnit determining how to interpret the timeout parameter Returns: the Future representing the next completed task or null if the specified waiting time elapses before one is present Throws: InterruptedException - if interrupted while waiting so the future will be null. but in Chris's box, the exception from thread pool will jump out firstly, so gone to L1140 directly: "catch (ExecutionException e)" so per my current understanding, it should be related with os thread schedule (granularity) , we probably need to enlarge the Mockito sleep interval.
          Liang Xie made changes -
          Attachment HDFS-6631.txt [ 12654772 ]
          Hide
          Liang Xie added a comment -

          I just uploaded a tentitive patch, Chris Nauroth, could you try it in your easy-repro env ? thank you very much!

          Show
          Liang Xie added a comment - I just uploaded a tentitive patch, Chris Nauroth , could you try it in your easy-repro env ? thank you very much!
          Liang Xie made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Liang Xie made changes -
          Assignee Liang Xie [ xieliang007 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12654772/HDFS-6631.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          -1 javac. The patch appears to cause the build to fail.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7301//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12654772/HDFS-6631.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. -1 javac . The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7301//console This message is automatically generated.
          Liang Xie made changes -
          Attachment HDFS-6631.txt [ 12654772 ]
          Hide
          Liang Xie added a comment -

          emm, why QA complain couldn't compile, i verified again it works. let me remove the old one and re-upload

          Show
          Liang Xie added a comment - emm, why QA complain couldn't compile, i verified again it works. let me remove the old one and re-upload
          Liang Xie made changes -
          Attachment HDFS-6631.txt [ 12654928 ]
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12654928/HDFS-6631.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7317//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7317//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12654928/HDFS-6631.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7317//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7317//console This message is automatically generated.
          Hide
          Chris Nauroth added a comment -

          Thanks, Liang, for the really nice analysis! The timings you chose in this patch make a lot of sense too when we consider that the default timer resolution on Windows is 15.6 ms. The old logic used just a 1 ms delta. Introducing a separate wait greater than 15.6 ms makes sense, which is what this patch does. I ran the test many times on both Mac and Windows, and I saw no failures.

          +1 for the patch. I'll commit this.

          Show
          Chris Nauroth added a comment - Thanks, Liang, for the really nice analysis! The timings you chose in this patch make a lot of sense too when we consider that the default timer resolution on Windows is 15.6 ms. The old logic used just a 1 ms delta. Introducing a separate wait greater than 15.6 ms makes sense, which is what this patch does. I ran the test many times on both Mac and Windows, and I saw no failures. +1 for the patch. I'll commit this.
          Chris Nauroth made changes -
          Hadoop Flags Reviewed [ 10343 ]
          Chris Nauroth made changes -
          Target Version/s 3.0.0, 2.6.0 [ 12320356, 12327181 ] 3.0.0, 2.5.0 [ 12320356, 12326264 ]
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #5859 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5859/)
          HDFS-6631. TestPread#testHedgedReadLoopTooManyTimes fails intermittently. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609534)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPread.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #5859 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5859/ ) HDFS-6631 . TestPread#testHedgedReadLoopTooManyTimes fails intermittently. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609534 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPread.java
          Hide
          Chris Nauroth added a comment -

          I committed this to trunk, branch-2 and branch-2.5. Liang Xie, thank you for your diligence in fixing these issues related to hedged reads.

          Show
          Chris Nauroth added a comment - I committed this to trunk, branch-2 and branch-2.5. Liang Xie , thank you for your diligence in fixing these issues related to hedged reads.
          Chris Nauroth made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 3.0.0 [ 12320356 ]
          Fix Version/s 2.5.0 [ 12326264 ]
          Resolution Fixed [ 1 ]
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #610 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/610/)
          HDFS-6631. TestPread#testHedgedReadLoopTooManyTimes fails intermittently. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609534)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPread.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #610 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/610/ ) HDFS-6631 . TestPread#testHedgedReadLoopTooManyTimes fails intermittently. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609534 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPread.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1828 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1828/)
          HDFS-6631. TestPread#testHedgedReadLoopTooManyTimes fails intermittently. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609534)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPread.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1828 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1828/ ) HDFS-6631 . TestPread#testHedgedReadLoopTooManyTimes fails intermittently. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609534 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPread.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #1801 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1801/)
          HDFS-6631. TestPread#testHedgedReadLoopTooManyTimes fails intermittently. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609534)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPread.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1801 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1801/ ) HDFS-6631 . TestPread#testHedgedReadLoopTooManyTimes fails intermittently. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1609534 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPread.java
          Karthik Kambatla (Inactive) made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Allen Wittenauer made changes -
          Fix Version/s 3.0.0 [ 12320356 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          1d 14h 46m 1 Liang Xie 09/Jul/14 10:06
          Patch Available Patch Available Resolved Resolved
          1d 9h 13m 1 Chris Nauroth 10/Jul/14 19:20
          Resolved Resolved Closed Closed
          35d 11h 20m 1 Karthik Kambatla (Inactive) 15/Aug/14 06:41

            People

            • Assignee:
              Liang Xie
              Reporter:
              Chris Nauroth
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development