Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.8.0, 3.0.0-alpha1
    • Fix Version/s: 2.8.0, 2.7.4, 3.0.0-alpha1
    • Component/s: test
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      It fails with java.util.concurrent.TimeoutException. Actually the problem here is that we expect 2 as NumOpenConnections metric but it is only 1. So the test waits 60 sec then fails.

      Please find maven output so the stack trace attached (TestJMXGetFails.log).

      1. HDFS-10270.001.patch
        0.9 kB
        Gergely Novák
      2. TestJMXGet.log
        95 kB
        Kihwal Lee
      3. TestJMXGetFails.log
        60 kB
        Andras Bokor

        Activity

        Hide
        kihwal Kihwal Lee added a comment -

        The test sets up two datanodes and creates a file. Then it wait for NumOpenConnections to reach 2. According to the log, registerDatanode() was called twice, one for each datanode.
        Both datanodes are receiving blocks and sending IBRs. But when it checks for NumOpenConnections, it stays at 1. The test times out and fails.

        It looks like an issue in the metrics.

        Show
        kihwal Kihwal Lee added a comment - The test sets up two datanodes and creates a file. Then it wait for NumOpenConnections to reach 2. According to the log, registerDatanode() was called twice, one for each datanode. Both datanodes are receiving blocks and sending IBRs. But when it checks for NumOpenConnections, it stays at 1. The test times out and fails. It looks like an issue in the metrics.
        Hide
        kihwal Kihwal Lee added a comment -

        Attaching the test log. This is from a 2.7 build but fails the same way.

        Show
        kihwal Kihwal Lee added a comment - Attaching the test log. This is from a 2.7 build but fails the same way.
        Hide
        boky01 Andras Bokor added a comment -

        I found that it fails when I run the two tests together so testDataNode() runs before testNameNode(). If I run only testNameNode or I change the order the tests pass.

        I debugged a bit the failed run. It opens two RPC connections but later on it closes one of them in Server class doRead(SelectionKey key) method:

          if (count < 0) {
            closeConnection(c);
            c = null;
          }
        

        Here the count is -1. We get -1 in Server:channelRead(ReadableByteChannel channel, ByteBuffer buffer) method:

        int count = (buffer.remaining() <= NIO_BUFFER_LIMIT) ?
                        channel.read(buffer) : channelIO(channel, null, buffer);
        

        In this condition the true branch will run.

        Can somebody help us on this?

        Show
        boky01 Andras Bokor added a comment - I found that it fails when I run the two tests together so testDataNode() runs before testNameNode(). If I run only testNameNode or I change the order the tests pass. I debugged a bit the failed run. It opens two RPC connections but later on it closes one of them in Server class doRead(SelectionKey key) method: if (count < 0) { closeConnection(c); c = null ; } Here the count is -1. We get -1 in Server:channelRead(ReadableByteChannel channel, ByteBuffer buffer) method: int count = (buffer.remaining() <= NIO_BUFFER_LIMIT) ? channel.read(buffer) : channelIO(channel, null , buffer); In this condition the true branch will run. Can somebody help us on this?
        Hide
        GergelyNovak Gergely Novák added a comment -

        We observed that the fail is caused by this assert:

        DFSTestUtil.waitForMetric(jmx, "NumOpenConnections", numDatanodes);
        

        This checks if the number of open connections equals to the number of data nodes. But the number of open connections has absolutely no dependency from the data nodes: it's either 0, 1 (DataNodeProtocol) or 2 (DataNodeProtocol and ClientProtocol). The test passes in those rare cases when the ClientProtocol hasn't timed out when the TestNameNode runs (this can only happen if the tests are run separately). If we increment the number of data nodes (to 3, or so) the test will always fail. Contrarily if we increase the client timeout (ipc.client.connection.maxidletime) the test will always pass.

        Our suggestion is to remove this assert.

        Show
        GergelyNovak Gergely Novák added a comment - We observed that the fail is caused by this assert: DFSTestUtil.waitForMetric(jmx, "NumOpenConnections" , numDatanodes); This checks if the number of open connections equals to the number of data nodes. But the number of open connections has absolutely no dependency from the data nodes: it's either 0, 1 (DataNodeProtocol) or 2 (DataNodeProtocol and ClientProtocol). The test passes in those rare cases when the ClientProtocol hasn't timed out when the TestNameNode runs (this can only happen if the tests are run separately). If we increment the number of data nodes (to 3, or so) the test will always fail. Contrarily if we increase the client timeout ( ipc.client.connection.maxidletime ) the test will always pass. Our suggestion is to remove this assert.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 24s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 9m 25s trunk passed
        +1 compile 1m 7s trunk passed with JDK v1.8.0_77
        +1 compile 1m 2s trunk passed with JDK v1.7.0_95
        +1 checkstyle 0m 32s trunk passed
        +1 mvnsite 1m 17s trunk passed
        +1 mvneclipse 0m 20s trunk passed
        +1 findbugs 2m 42s trunk passed
        +1 javadoc 1m 52s trunk passed with JDK v1.8.0_77
        +1 javadoc 1m 56s trunk passed with JDK v1.7.0_95
        +1 mvninstall 0m 52s the patch passed
        +1 compile 0m 51s the patch passed with JDK v1.8.0_77
        +1 javac 0m 51s the patch passed
        +1 compile 0m 46s the patch passed with JDK v1.7.0_95
        +1 javac 0m 46s the patch passed
        +1 checkstyle 0m 19s the patch passed
        +1 mvnsite 0m 56s the patch passed
        +1 mvneclipse 0m 13s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 2m 16s the patch passed
        +1 javadoc 1m 17s the patch passed with JDK v1.8.0_77
        +1 javadoc 2m 1s the patch passed with JDK v1.7.0_95
        -1 unit 103m 25s hadoop-hdfs in the patch failed with JDK v1.8.0_77.
        -1 unit 95m 16s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
        +1 asflicense 0m 22s Patch does not generate ASF License warnings.
        231m 44s



        Reason Tests
        JDK v1.8.0_77 Failed junit tests hadoop.hdfs.TestDFSUpgradeFromImage
          hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations
          hadoop.hdfs.server.datanode.TestDataNodeMetrics
          hadoop.hdfs.security.TestDelegationTokenForProxyUser
          hadoop.hdfs.server.datanode.TestDataNodeMXBean
          hadoop.hdfs.server.namenode.ha.TestRequestHedgingProxyProvider
          hadoop.hdfs.server.datanode.TestDirectoryScanner
          hadoop.fs.contract.hdfs.TestHDFSContractSeek
        JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestDFSUpgradeFromImage
          hadoop.hdfs.server.namenode.ha.TestEditLogTailer
          hadoop.hdfs.server.namenode.TestSecureNameNode
          hadoop.hdfs.shortcircuit.TestShortCircuitCache
          hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations
          hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure
          hadoop.hdfs.server.namenode.TestReconstructStripedBlocks
          hadoop.hdfs.server.namenode.ha.TestHAAppend
          hadoop.hdfs.server.namenode.TestFileTruncate
          hadoop.hdfs.TestEncryptionZones
          hadoop.fs.contract.hdfs.TestHDFSContractSeek



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:fbe3e86
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12797999/HDFS-10270.patch
        JIRA Issue HDFS-10270
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 682c387374e8 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 1ff27f9
        Default Java 1.7.0_95
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15132/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15132/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
        unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15132/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15132/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
        JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15132/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15132/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 24s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 9m 25s trunk passed +1 compile 1m 7s trunk passed with JDK v1.8.0_77 +1 compile 1m 2s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 32s trunk passed +1 mvnsite 1m 17s trunk passed +1 mvneclipse 0m 20s trunk passed +1 findbugs 2m 42s trunk passed +1 javadoc 1m 52s trunk passed with JDK v1.8.0_77 +1 javadoc 1m 56s trunk passed with JDK v1.7.0_95 +1 mvninstall 0m 52s the patch passed +1 compile 0m 51s the patch passed with JDK v1.8.0_77 +1 javac 0m 51s the patch passed +1 compile 0m 46s the patch passed with JDK v1.7.0_95 +1 javac 0m 46s the patch passed +1 checkstyle 0m 19s the patch passed +1 mvnsite 0m 56s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 2m 16s the patch passed +1 javadoc 1m 17s the patch passed with JDK v1.8.0_77 +1 javadoc 2m 1s the patch passed with JDK v1.7.0_95 -1 unit 103m 25s hadoop-hdfs in the patch failed with JDK v1.8.0_77. -1 unit 95m 16s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 22s Patch does not generate ASF License warnings. 231m 44s Reason Tests JDK v1.8.0_77 Failed junit tests hadoop.hdfs.TestDFSUpgradeFromImage   hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations   hadoop.hdfs.server.datanode.TestDataNodeMetrics   hadoop.hdfs.security.TestDelegationTokenForProxyUser   hadoop.hdfs.server.datanode.TestDataNodeMXBean   hadoop.hdfs.server.namenode.ha.TestRequestHedgingProxyProvider   hadoop.hdfs.server.datanode.TestDirectoryScanner   hadoop.fs.contract.hdfs.TestHDFSContractSeek JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestDFSUpgradeFromImage   hadoop.hdfs.server.namenode.ha.TestEditLogTailer   hadoop.hdfs.server.namenode.TestSecureNameNode   hadoop.hdfs.shortcircuit.TestShortCircuitCache   hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure   hadoop.hdfs.server.namenode.TestReconstructStripedBlocks   hadoop.hdfs.server.namenode.ha.TestHAAppend   hadoop.hdfs.server.namenode.TestFileTruncate   hadoop.hdfs.TestEncryptionZones   hadoop.fs.contract.hdfs.TestHDFSContractSeek Subsystem Report/Notes Docker Image:yetus/hadoop:fbe3e86 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12797999/HDFS-10270.patch JIRA Issue HDFS-10270 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 682c387374e8 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 1ff27f9 Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/15132/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15132/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15132/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15132/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15132/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15132/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        kihwal Kihwal Lee added a comment -

        It looks like a race was introduced after the jmx caching was "fixed". Certain metrics values in JMX is now using the cached value, which expires in 10 seconds by default. Since the ipc client idle timeout is also 10 seconds, The ipc connection may get closed before the jmx refresh. So the number of actual connections reaches to 2, but can drop to 1 just before a new jmx data is fetched. We could fix this by either lowering the jmx cache expiration or increasing ipc client idle timeout. Since it is a JMX test, it might be better to leave the jmx setting as default and change the ipc timeout to, say, 15 seconds.

        Show
        kihwal Kihwal Lee added a comment - It looks like a race was introduced after the jmx caching was "fixed". Certain metrics values in JMX is now using the cached value, which expires in 10 seconds by default. Since the ipc client idle timeout is also 10 seconds, The ipc connection may get closed before the jmx refresh. So the number of actual connections reaches to 2, but can drop to 1 just before a new jmx data is fetched. We could fix this by either lowering the jmx cache expiration or increasing ipc client idle timeout. Since it is a JMX test, it might be better to leave the jmx setting as default and change the ipc timeout to, say, 15 seconds.
        Hide
        GergelyNovak Gergely Novák added a comment -

        So Kihwal Lee, are you suggesting to have a wait/assert for checking if the number of connections is equal to 2 (datanode, client)? And to fulfil this to increase the ipc client idle timeout?
        If so, can you (or the original author of the test) explain why we need to test if the client is still connected?

        Show
        GergelyNovak Gergely Novák added a comment - So Kihwal Lee , are you suggesting to have a wait/assert for checking if the number of connections is equal to 2 (datanode, client)? And to fulfil this to increase the ipc client idle timeout? If so, can you (or the original author of the test) explain why we need to test if the client is still connected?
        Hide
        kihwal Kihwal Lee added a comment -

        It is not for testing the client connection being up. It is simply checking one of the metrics values reported in JMX. I don't know the reason why NumOpenConnections was chosen. The test had worked reliably until the jmx caching was fixed. The values used to be available right away, but now it takes about 10 seconds. So when it's working it adds about 10 more seconds of delay.

        But the original author also made a wrong assumption. The assumption was that the reason for the number of connections being 2 is due to having two datanodes. As you have correctly analyzed, this is not true in a MiniDFSCluster. Since the two datanodes are sharing the same JVM, a single connection was shared for the DatanodeProtocol. An additional connection was made for the client. In a real distributed cluster, it would have been 3 connections.

        I lean toward fixing the existing check than removing it. First it shouldn't check against the number of datanods, but simply 2. Regarding increasing ipc client idle timeout, it will make test run time longer, which is against what we have been trying to do. An alternative is to add a test resource to reduce the jmx update interval. We could add a hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop-metrics2.properties file with one line containing *.period=1. This will also reduce the run time of a number of test cases that query jmx to verify the result.

        What do you think?

        Show
        kihwal Kihwal Lee added a comment - It is not for testing the client connection being up. It is simply checking one of the metrics values reported in JMX. I don't know the reason why NumOpenConnections was chosen. The test had worked reliably until the jmx caching was fixed. The values used to be available right away, but now it takes about 10 seconds. So when it's working it adds about 10 more seconds of delay. But the original author also made a wrong assumption. The assumption was that the reason for the number of connections being 2 is due to having two datanodes. As you have correctly analyzed, this is not true in a MiniDFSCluster. Since the two datanodes are sharing the same JVM, a single connection was shared for the DatanodeProtocol . An additional connection was made for the client. In a real distributed cluster, it would have been 3 connections. I lean toward fixing the existing check than removing it. First it shouldn't check against the number of datanods, but simply 2. Regarding increasing ipc client idle timeout, it will make test run time longer, which is against what we have been trying to do. An alternative is to add a test resource to reduce the jmx update interval. We could add a hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop-metrics2.properties file with one line containing *.period=1 . This will also reduce the run time of a number of test cases that query jmx to verify the result. What do you think?
        Hide
        kihwal Kihwal Lee added a comment -

        If adding a new test resource, don't forget to add it to the exclude list for apache-rat-plugin.

        Show
        kihwal Kihwal Lee added a comment - If adding a new test resource, don't forget to add it to the exclude list for apache-rat-plugin.
        Hide
        boky01 Andras Bokor added a comment -

        Hi Kihwal Lee,

        We do not see the value to check NumOpenConnections since it can cause intermittent failures and with the other checks the JMX metrics are tested enough. Our suggestion is to remove NumOpenConnections check. Do you see any disadvantages removing NumOpenConnections check?

        Show
        boky01 Andras Bokor added a comment - Hi Kihwal Lee , We do not see the value to check NumOpenConnections since it can cause intermittent failures and with the other checks the JMX metrics are tested enough. Our suggestion is to remove NumOpenConnections check. Do you see any disadvantages removing NumOpenConnections check?
        Hide
        boky01 Andras Bokor added a comment -

        In addition playing with timeouts makes a JUnit test more complicated than a JUnit test should be. What is you opinion about these?

        Show
        boky01 Andras Bokor added a comment - In addition playing with timeouts makes a JUnit test more complicated than a JUnit test should be. What is you opinion about these?
        Hide
        kihwal Kihwal Lee added a comment -

        In addition playing with timeouts makes a JUnit test more complicated than a JUnit test should be. What is you opinion about these?

        Totally agree, but in reality hadoop unit tests are riddled with unreliable timeouts. Many of them are not really unit tests. They exist because of many reasons including Inherent difficulties in unit testing distributed functionalities, test unfriendly designs and simply bad test designs. Many are aware of the issue.

        Show
        kihwal Kihwal Lee added a comment - In addition playing with timeouts makes a JUnit test more complicated than a JUnit test should be. What is you opinion about these? Totally agree, but in reality hadoop unit tests are riddled with unreliable timeouts. Many of them are not really unit tests. They exist because of many reasons including Inherent difficulties in unit testing distributed functionalities, test unfriendly designs and simply bad test designs. Many are aware of the issue.
        Hide
        kihwal Kihwal Lee added a comment -

        Do you see any disadvantages removing NumOpenConnections check?

        I was trying to keep whatever the original author did, but in term of the value of the check, I think it is fine to remove. So +1 for the patch.

        Show
        kihwal Kihwal Lee added a comment - Do you see any disadvantages removing NumOpenConnections check? I was trying to keep whatever the original author did, but in term of the value of the check, I think it is fine to remove. So +1 for the patch.
        Hide
        kihwal Kihwal Lee added a comment -

        Committed it to trunk, branch-2 and branch-2.8. Thanks for fixing this, Gergely.

        Show
        kihwal Kihwal Lee added a comment - Committed it to trunk, branch-2 and branch-2.8. Thanks for fixing this, Gergely.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #9603 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9603/)
        HDFS-10270. TestJMXGet:testNameNode() fails. Contributed by Gergely (kihwal: rev d2f3bbc29046435904ad9418073795439c71b441)

        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/tools/TestJMXGet.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9603 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9603/ ) HDFS-10270 . TestJMXGet:testNameNode() fails. Contributed by Gergely (kihwal: rev d2f3bbc29046435904ad9418073795439c71b441) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/tools/TestJMXGet.java
        Hide
        zhz Zhe Zhang added a comment -

        I just committed the patch to branch-2.7 as part of an effort of backporting HDFS-7933. The patch could only possibly affect TestJMXGet and I verified locally that it passes fine with the patch; the patch actually only makes the test easier to pass.

        Show
        zhz Zhe Zhang added a comment - I just committed the patch to branch-2.7 as part of an effort of backporting HDFS-7933 . The patch could only possibly affect TestJMXGet and I verified locally that it passes fine with the patch; the patch actually only makes the test easier to pass.

          People

          • Assignee:
            GergelyNovak Gergely Novák
            Reporter:
            boky01 Andras Bokor
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development