Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9466

TestShortCircuitCache#testDataXceiverCleansUpSlotsOnFailure is flaky

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: fs, hdfs-client
    • Labels:
      None
    • Target Version/s:

      Description

      This test is flaky and fails quite frequently in trunk.
      Error Message
      expected:<1> but was:<2>
      Stacktrace

      java.lang.AssertionError: expected:<1> but was:<2>
      	at org.junit.Assert.fail(Assert.java:88)
      	at org.junit.Assert.failNotEquals(Assert.java:743)
      	at org.junit.Assert.assertEquals(Assert.java:118)
      	at org.junit.Assert.assertEquals(Assert.java:555)
      	at org.junit.Assert.assertEquals(Assert.java:542)
      	at org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitCache$17.accept(TestShortCircuitCache.java:636)
      	at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.visit(ShortCircuitRegistry.java:395)
      	at org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitCache.checkNumberOfSegmentsAndSlots(TestShortCircuitCache.java:631)
      	at org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitCache.testDataXceiverCleansUpSlotsOnFailure(TestShortCircuitCache.java:684)
      

      Thanks to Xiao Chen for identifying the issue.

      1. org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitCache-output.txt
        100 kB
        Wei-Chiu Chuang
      2. HDFS-9466.002.patch
        4 kB
        Wei-Chiu Chuang
      3. HDFS-9466.001.patch
        3 kB
        Wei-Chiu Chuang

        Issue Links

          Activity

          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          This is a regression of HDFS-7915.

          Show
          jojochuang Wei-Chiu Chuang added a comment - This is a regression of HDFS-7915 .
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          From what I can see, the assertion failed when the expected number of slots is 2, not 1.

          Show
          jojochuang Wei-Chiu Chuang added a comment - From what I can see, the assertion failed when the expected number of slots is 2, not 1.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Assigning this JIRA to me. I found it is due to race condition. The test checks the number of slots before the slots are removed.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Assigning this JIRA to me. I found it is due to race condition. The test checks the number of slots before the slots are removed.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Rev01: changed checkNumberOfSegmentsAndSlots. Instead of assert the number of slots right away, use GenericTestUtils.waitFor() helper method to periodically check the number of slots. The time out of GenericTestUtils.waitFor is set to infinity, but the test case it self has a predefined timeout, so it shouldn't matter.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Rev01: changed checkNumberOfSegmentsAndSlots. Instead of assert the number of slots right away, use GenericTestUtils.waitFor() helper method to periodically check the number of slots. The time out of GenericTestUtils.waitFor is set to infinity, but the test case it self has a predefined timeout, so it shouldn't matter.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 8m 1s trunk passed
          +1 compile 0m 45s trunk passed with JDK v1.8.0_66
          +1 compile 0m 42s trunk passed with JDK v1.7.0_91
          +1 checkstyle 0m 17s trunk passed
          +1 mvnsite 0m 55s trunk passed
          +1 mvneclipse 0m 14s trunk passed
          +1 findbugs 2m 3s trunk passed
          +1 javadoc 1m 9s trunk passed with JDK v1.8.0_66
          +1 javadoc 1m 55s trunk passed with JDK v1.7.0_91
          +1 mvninstall 0m 51s the patch passed
          +1 compile 0m 44s the patch passed with JDK v1.8.0_66
          +1 javac 0m 43s the patch passed
          +1 compile 0m 43s the patch passed with JDK v1.7.0_91
          +1 javac 0m 43s the patch passed
          +1 checkstyle 0m 16s the patch passed
          +1 mvnsite 0m 56s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 2m 8s the patch passed
          +1 javadoc 1m 10s the patch passed with JDK v1.8.0_66
          +1 javadoc 1m 49s the patch passed with JDK v1.7.0_91
          -1 unit 57m 45s hadoop-hdfs in the patch failed with JDK v1.8.0_66.
          -1 unit 54m 40s hadoop-hdfs in the patch failed with JDK v1.7.0_91.
          +1 asflicense 0m 25s Patch does not generate ASF License warnings.
          140m 29s



          Reason Tests
          JDK v1.8.0_66 Failed junit tests hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion
            hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA
            hadoop.hdfs.server.datanode.TestBlockScanner
            hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots
            hadoop.hdfs.server.namenode.TestNNThroughputBenchmark
            hadoop.hdfs.server.blockmanagement.TestReplicationPolicyConsiderLoad
          JDK v1.7.0_91 Failed junit tests hadoop.hdfs.server.namenode.snapshot.TestSnapshot
            hadoop.hdfs.TestDFSShell
            hadoop.hdfs.server.namenode.TestNNThroughputBenchmark
            hadoop.hdfs.server.blockmanagement.TestReplicationPolicyConsiderLoad



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ca8df7
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12779232/HDFS-9466.001.patch
          JIRA Issue HDFS-9466
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 7057e9fe416d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 8c180a1
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/13984/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/13984/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_91.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/13984/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-HDFS-Build/13984/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_91.txt
          JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/13984/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Max memory used 76MB
          Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/13984/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 8m 1s trunk passed +1 compile 0m 45s trunk passed with JDK v1.8.0_66 +1 compile 0m 42s trunk passed with JDK v1.7.0_91 +1 checkstyle 0m 17s trunk passed +1 mvnsite 0m 55s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 2m 3s trunk passed +1 javadoc 1m 9s trunk passed with JDK v1.8.0_66 +1 javadoc 1m 55s trunk passed with JDK v1.7.0_91 +1 mvninstall 0m 51s the patch passed +1 compile 0m 44s the patch passed with JDK v1.8.0_66 +1 javac 0m 43s the patch passed +1 compile 0m 43s the patch passed with JDK v1.7.0_91 +1 javac 0m 43s the patch passed +1 checkstyle 0m 16s the patch passed +1 mvnsite 0m 56s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 2m 8s the patch passed +1 javadoc 1m 10s the patch passed with JDK v1.8.0_66 +1 javadoc 1m 49s the patch passed with JDK v1.7.0_91 -1 unit 57m 45s hadoop-hdfs in the patch failed with JDK v1.8.0_66. -1 unit 54m 40s hadoop-hdfs in the patch failed with JDK v1.7.0_91. +1 asflicense 0m 25s Patch does not generate ASF License warnings. 140m 29s Reason Tests JDK v1.8.0_66 Failed junit tests hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion   hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA   hadoop.hdfs.server.datanode.TestBlockScanner   hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots   hadoop.hdfs.server.namenode.TestNNThroughputBenchmark   hadoop.hdfs.server.blockmanagement.TestReplicationPolicyConsiderLoad JDK v1.7.0_91 Failed junit tests hadoop.hdfs.server.namenode.snapshot.TestSnapshot   hadoop.hdfs.TestDFSShell   hadoop.hdfs.server.namenode.TestNNThroughputBenchmark   hadoop.hdfs.server.blockmanagement.TestReplicationPolicyConsiderLoad Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12779232/HDFS-9466.001.patch JIRA Issue HDFS-9466 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 7057e9fe416d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 8c180a1 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/13984/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/13984/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_91.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/13984/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-HDFS-Build/13984/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_91.txt JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/13984/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Max memory used 76MB Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-HDFS-Build/13984/console This message was automatically generated.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Test failures appear unrelated.
          I've run the test locally and had no failures in 1000 runs. Previously the test fails frequently, about 1 in 5 runs.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Test failures appear unrelated. I've run the test locally and had no failures in 1000 runs. Previously the test fails frequently, about 1 in 5 runs.
          Hide
          xiaochen Xiao Chen added a comment -

          Thanks Wei-Chiu Chuang for working on this. The analysis sounds reasonable to me. I have 1 question and 2 minor comments.
          Question:
          Do we know why this failed only for testDataXceiverCleansUpSlotsOnFailure, but not for similar checks in testDataXceiverHandlesRequestShortCircuitShmFailure or testPreReceiptVerificationDfsClientCanDoScr? I personally only saw testDataXceiverCleansUpSlotsOnFailure fail intermittently, but never met the others. I could've missed them, but just want to understand better.
          Comments:

          • Can we update testPreReceiptVerificationDfsClientCanDoScr to use the extracted method checkNumberOfSegmentsAndSlots?
          • Can we still set an estimated max for waitFor? The test case has its own timeouts, but IMHO wait for Integer.MAX_VALUE is not best practice.
          Show
          xiaochen Xiao Chen added a comment - Thanks Wei-Chiu Chuang for working on this. The analysis sounds reasonable to me. I have 1 question and 2 minor comments. Question: Do we know why this failed only for testDataXceiverCleansUpSlotsOnFailure , but not for similar checks in testDataXceiverHandlesRequestShortCircuitShmFailure or testPreReceiptVerificationDfsClientCanDoScr ? I personally only saw testDataXceiverCleansUpSlotsOnFailure fail intermittently, but never met the others. I could've missed them, but just want to understand better. Comments: Can we update testPreReceiptVerificationDfsClientCanDoScr to use the extracted method checkNumberOfSegmentsAndSlots ? Can we still set an estimated max for waitFor ? The test case has its own timeouts, but IMHO wait for Integer.MAX_VALUE is not best practice.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Thank you very much for the review and comments!
          (1)
          AFAIK, the three tests use different fault injectors. testDataXceiverCleansUpSlotsOnFailure injects a failure at BlockReaderFactory.requestFileDescriptors(); while testDataXceiverHandlesRequestShortCircuitShmFailure injects a failure at DataXceiver.sendShmSuccessResponse, and testPreReceiptVerificationDfsClientCanDoScr injects a failure at BlockReaderFactory.requestFileDescriptors().

          (2) testPreReceiptVerificationDfsClientCanDoScr should also call checkNumberOfSegmentsAndSlots to maintain consistency.

          (3) Definitely. I was just not sure what would be a reasonable number. Setting 1 second or 10 seconds should be more than sufficient though.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Thank you very much for the review and comments! (1) AFAIK, the three tests use different fault injectors. testDataXceiverCleansUpSlotsOnFailure injects a failure at BlockReaderFactory.requestFileDescriptors(); while testDataXceiverHandlesRequestShortCircuitShmFailure injects a failure at DataXceiver.sendShmSuccessResponse, and testPreReceiptVerificationDfsClientCanDoScr injects a failure at BlockReaderFactory.requestFileDescriptors(). (2) testPreReceiptVerificationDfsClientCanDoScr should also call checkNumberOfSegmentsAndSlots to maintain consistency. (3) Definitely. I was just not sure what would be a reasonable number. Setting 1 second or 10 seconds should be more than sufficient though.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Rev02: set a time out of 10 seconds. use checkNumberOfSegmentsAndSlots() in testPreReceiptVerificationDfsClientCanDoScr() to check numbers of segments and slots.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Rev02: set a time out of 10 seconds. use checkNumberOfSegmentsAndSlots() in testPreReceiptVerificationDfsClientCanDoScr() to check numbers of segments and slots.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 8m 33s trunk passed
          +1 compile 0m 52s trunk passed with JDK v1.8.0_66
          +1 compile 0m 50s trunk passed with JDK v1.7.0_91
          +1 checkstyle 0m 16s trunk passed
          +1 mvnsite 0m 59s trunk passed
          +1 mvneclipse 0m 15s trunk passed
          +1 findbugs 2m 4s trunk passed
          +1 javadoc 1m 12s trunk passed with JDK v1.8.0_66
          +1 javadoc 1m 56s trunk passed with JDK v1.7.0_91
          +1 mvninstall 0m 57s the patch passed
          +1 compile 0m 45s the patch passed with JDK v1.8.0_66
          +1 javac 0m 45s the patch passed
          +1 compile 0m 50s the patch passed with JDK v1.7.0_91
          +1 javac 0m 50s the patch passed
          +1 checkstyle 0m 19s the patch passed
          +1 mvnsite 1m 0s the patch passed
          +1 mvneclipse 0m 15s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 2m 19s the patch passed
          +1 javadoc 1m 15s the patch passed with JDK v1.8.0_66
          +1 javadoc 1m 59s the patch passed with JDK v1.7.0_91
          -1 unit 69m 20s hadoop-hdfs in the patch failed with JDK v1.8.0_66.
          -1 unit 63m 18s hadoop-hdfs in the patch failed with JDK v1.7.0_91.
          -1 asflicense 0m 25s Patch generated 1 ASF License warnings.
          162m 43s



          Reason Tests
          JDK v1.8.0_66 Failed junit tests hadoop.hdfs.server.datanode.TestBlockScanner
            hadoop.hdfs.server.blockmanagement.TestReplicationPolicyConsiderLoad
            hadoop.hdfs.server.datanode.TestBlockReplacement
            hadoop.hdfs.web.TestWebHDFSXAttr
            hadoop.hdfs.TestDFSStripedOutputStreamWithFailure190
          JDK v1.7.0_91 Failed junit tests hadoop.hdfs.server.blockmanagement.TestReplicationPolicyConsiderLoad
            hadoop.hdfs.TestRollingUpgrade
            hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ca8df7
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12779321/HDFS-9466.002.patch
          JIRA Issue HDFS-9466
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 413c1ad9dff9 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / bb5df27
          Default Java 1.7.0_91
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/13987/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/13987/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_91.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/13987/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-HDFS-Build/13987/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_91.txt
          JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/13987/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/13987/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Max memory used 75MB
          Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/13987/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 8m 33s trunk passed +1 compile 0m 52s trunk passed with JDK v1.8.0_66 +1 compile 0m 50s trunk passed with JDK v1.7.0_91 +1 checkstyle 0m 16s trunk passed +1 mvnsite 0m 59s trunk passed +1 mvneclipse 0m 15s trunk passed +1 findbugs 2m 4s trunk passed +1 javadoc 1m 12s trunk passed with JDK v1.8.0_66 +1 javadoc 1m 56s trunk passed with JDK v1.7.0_91 +1 mvninstall 0m 57s the patch passed +1 compile 0m 45s the patch passed with JDK v1.8.0_66 +1 javac 0m 45s the patch passed +1 compile 0m 50s the patch passed with JDK v1.7.0_91 +1 javac 0m 50s the patch passed +1 checkstyle 0m 19s the patch passed +1 mvnsite 1m 0s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 2m 19s the patch passed +1 javadoc 1m 15s the patch passed with JDK v1.8.0_66 +1 javadoc 1m 59s the patch passed with JDK v1.7.0_91 -1 unit 69m 20s hadoop-hdfs in the patch failed with JDK v1.8.0_66. -1 unit 63m 18s hadoop-hdfs in the patch failed with JDK v1.7.0_91. -1 asflicense 0m 25s Patch generated 1 ASF License warnings. 162m 43s Reason Tests JDK v1.8.0_66 Failed junit tests hadoop.hdfs.server.datanode.TestBlockScanner   hadoop.hdfs.server.blockmanagement.TestReplicationPolicyConsiderLoad   hadoop.hdfs.server.datanode.TestBlockReplacement   hadoop.hdfs.web.TestWebHDFSXAttr   hadoop.hdfs.TestDFSStripedOutputStreamWithFailure190 JDK v1.7.0_91 Failed junit tests hadoop.hdfs.server.blockmanagement.TestReplicationPolicyConsiderLoad   hadoop.hdfs.TestRollingUpgrade   hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12779321/HDFS-9466.002.patch JIRA Issue HDFS-9466 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 413c1ad9dff9 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / bb5df27 Default Java 1.7.0_91 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/13987/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/13987/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_91.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/13987/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-HDFS-Build/13987/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_91.txt JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/13987/testReport/ asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/13987/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Max memory used 75MB Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-HDFS-Build/13987/console This message was automatically generated.
          Hide
          xiaochen Xiao Chen added a comment -

          Thanks Wei-Chiu. +1 (non-binding).

          Show
          xiaochen Xiao Chen added a comment - Thanks Wei-Chiu. +1 (non-binding).
          Hide
          cmccabe Colin P. McCabe added a comment -

          Thanks for looking at this, Wei-Chiu Chuang. The idea behind this test is that the slots should never be removed by a timeout. That's why this line is there:

              conf.setLong(
                  HdfsClientConfigKeys.Read.ShortCircuit.STREAMS_CACHE_EXPIRY_MS_KEY,
                  1000000000L);
          

          Unless I'm missing something, adding a waitFor will not have any effect here, since the streams expiry timeout is longer than the test timeout. I'm not completely sure why we're getting 2 slots here instead of 1... we might need to see if we can get log files from a test run that failed.

          Show
          cmccabe Colin P. McCabe added a comment - Thanks for looking at this, Wei-Chiu Chuang . The idea behind this test is that the slots should never be removed by a timeout. That's why this line is there: conf.setLong( HdfsClientConfigKeys.Read.ShortCircuit.STREAMS_CACHE_EXPIRY_MS_KEY, 1000000000L); Unless I'm missing something, adding a waitFor will not have any effect here, since the streams expiry timeout is longer than the test timeout. I'm not completely sure why we're getting 2 slots here instead of 1... we might need to see if we can get log files from a test run that failed.
          Hide
          xiaochen Xiao Chen added a comment -

          Hi Colin P. McCabe,
          Thanks for looking into this patch.
          IIUC, the patch is trying to fix the race where the check happens before the slot is removed before the failed call (The following is injected to be failing). So the expiry should be fine, and the waitFor is just for testing purpose on the check. Wei-Chiu please correct me if I'm wrong.

              // The second read should fail, and we should only have 1 segment and 1 slot
              // left.
              fs.getClient().getConf().brfFailureInjector =
                  new TestCleanupFailureInjector();
              try {
                DFSTestUtil.readFileBuffer(fs, TEST_PATH2);
              } catch (Throwable t) {
                GenericTestUtils.assertExceptionContains("TCP reads were disabled for " +
                    "testing, but we failed to do a non-TCP read.", t);
              }
          
          Show
          xiaochen Xiao Chen added a comment - Hi Colin P. McCabe , Thanks for looking into this patch. IIUC, the patch is trying to fix the race where the check happens before the slot is removed before the failed call (The following is injected to be failing). So the expiry should be fine, and the waitFor is just for testing purpose on the check. Wei-Chiu please correct me if I'm wrong. // The second read should fail, and we should only have 1 segment and 1 slot // left. fs.getClient().getConf().brfFailureInjector = new TestCleanupFailureInjector(); try { DFSTestUtil.readFileBuffer(fs, TEST_PATH2); } catch (Throwable t) { GenericTestUtils.assertExceptionContains( "TCP reads were disabled for " + "testing, but we failed to do a non-TCP read." , t); }
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Colin P. McCabe Xiao is right about what I thought. It does appear there is a race. From your perspective, do you think that's by design, or some unintended bugs in the code?

          Thanks for the reviews!

          Show
          jojochuang Wei-Chiu Chuang added a comment - Colin P. McCabe Xiao is right about what I thought. It does appear there is a race. From your perspective, do you think that's by design, or some unintended bugs in the code? Thanks for the reviews!
          Hide
          cmccabe Colin P. McCabe added a comment -

          Hmm. Can you be clearer on what the race condition is here?

          Show
          cmccabe Colin P. McCabe added a comment - Hmm. Can you be clearer on what the race condition is here?
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Attaching the log of a failed run.

          Colin P. McCabe sorry for the long delay ...

          The hypothesis here is that after TestCleanupFailureInjector#injectRequestFileDescriptorsFailure throws an exception to inject failure, it takes some time to propagate that exception, and we occasionally test and verify if the number of slots is one, before it catches the exception and remove the slot. This race between removing the slot and checking the slot size failed the test.

          Adding a watiFor seem to remove the race.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Attaching the log of a failed run. Colin P. McCabe sorry for the long delay ... The hypothesis here is that after TestCleanupFailureInjector#injectRequestFileDescriptorsFailure throws an exception to inject failure, it takes some time to propagate that exception, and we occasionally test and verify if the number of slots is one, before it catches the exception and remove the slot. This race between removing the slot and checking the slot size failed the test. Adding a watiFor seem to remove the race.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Hello Colin P. McCabe, would you mind to review it again? Thank you very much!

          Show
          jojochuang Wei-Chiu Chuang added a comment - Hello Colin P. McCabe , would you mind to review it again? Thank you very much!
          Hide
          cmccabe Colin P. McCabe added a comment - - edited

          Thanks for the explanation. It sounds like the race condition is that the ShortCircuitRegistry on the DN needs to be informed about the client's decision that short-circuit is not working for the block, and this RPC takes time to arrive. That background process races with completing the TCP read successfully and checking the number of slots in the unit test.

             public static interface Visitor {
          -    void accept(HashMap<ShmId, RegisteredShm> segments,
          +    boolean accept(HashMap<ShmId, RegisteredShm> segments,
                           HashMultimap<ExtendedBlockId, Slot> slots);
             }
          

          I don't think it makes sense to change the return type of the visitor. While you might find a boolean convenient, some other potential users of the interface might not find it useful. Instead, just have your closure modify a final MutableBoolean declared nearby.

          +    }, 100, 10000);
          

          It seems like we could lower the latency here (perhaps check every 10 ms) and lengthen the timeout. Since the test timeouts are generally 60s, I don't think it makes sense to make this timeout shorter than that.

          +1 once that's addressed. Thanks, Wei-Chiu Chuang. Sorry for the delay in reviews.

          Show
          cmccabe Colin P. McCabe added a comment - - edited Thanks for the explanation. It sounds like the race condition is that the ShortCircuitRegistry on the DN needs to be informed about the client's decision that short-circuit is not working for the block, and this RPC takes time to arrive. That background process races with completing the TCP read successfully and checking the number of slots in the unit test. public static interface Visitor { - void accept(HashMap<ShmId, RegisteredShm> segments, + boolean accept(HashMap<ShmId, RegisteredShm> segments, HashMultimap<ExtendedBlockId, Slot> slots); } I don't think it makes sense to change the return type of the visitor. While you might find a boolean convenient, some other potential users of the interface might not find it useful. Instead, just have your closure modify a final MutableBoolean declared nearby. + }, 100, 10000); It seems like we could lower the latency here (perhaps check every 10 ms) and lengthen the timeout. Since the test timeouts are generally 60s, I don't think it makes sense to make this timeout shorter than that. +1 once that's addressed. Thanks, Wei-Chiu Chuang . Sorry for the delay in reviews.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #9891 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9891/)
          HDFS-9466. TestShortCircuitCache#testDataXceiverCleansUpSlotsOnFailure (cmccabe: rev c7921c9bddb79c9db5059b6c3f7a3a586a3cd95b)

          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/ShortCircuitRegistry.java
          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/shortcircuit/TestShortCircuitCache.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #9891 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9891/ ) HDFS-9466 . TestShortCircuitCache#testDataXceiverCleansUpSlotsOnFailure (cmccabe: rev c7921c9bddb79c9db5059b6c3f7a3a586a3cd95b) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/ShortCircuitRegistry.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/shortcircuit/TestShortCircuitCache.java
          Hide
          cmccabe Colin P. McCabe added a comment -

          Committed to 2.8. Thanks, Wei-Chiu Chuang.

          Show
          cmccabe Colin P. McCabe added a comment - Committed to 2.8. Thanks, Wei-Chiu Chuang .

            People

            • Assignee:
              jojochuang Wei-Chiu Chuang
              Reporter:
              jojochuang Wei-Chiu Chuang
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development