Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-797

TestHDFSCLI much slower after HDFS-265 merge

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.21.0, 0.22.0
    • Fix Version/s: 0.21.0
    • Component/s: hdfs-client
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      After the merge of HDFS-265 into trunk, TestHDFSCLI takes much longer to run (>10min, got tired of waiting). Changing dataQueue.wait(1000) to dataQueue.wait(10) on line 2708 speeds up the test back to its original performance. This isn't a fix, but displays the issue - somehow the thread isn't getting properly interrupted. In general this probably represents a performance regression for the exit of all HDFS writers (most noticeable in CLI tools)

      1. hdfs-797.txt
        0.7 kB
        Todd Lipcon

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #148 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/148/)

        Show
        Hudson added a comment - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #148 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/148/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #170 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/170/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #170 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/170/ )
        Hide
        Hudson added a comment -

        Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #83 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/83/)

        Show
        Hudson added a comment - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #83 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/83/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #139 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/139/)
        . TestHDFSCLI much slower after HDFS-265 merge. Contributed by Todd Lipcon.

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #139 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/139/ ) . TestHDFSCLI much slower after HDFS-265 merge. Contributed by Todd Lipcon.
        Hide
        Konstantin Boudnik added a comment -

        I've just committed this. Thank you Todd.

        Show
        Konstantin Boudnik added a comment - I've just committed this. Thank you Todd.
        Hide
        Hairong Kuang added a comment -

        +1. The patch looks good. Thanks for fixing this.

        Show
        Hairong Kuang added a comment - +1. The patch looks good. Thanks for fixing this.
        Hide
        Todd Lipcon added a comment -

        Anyone want to commit this? Running the HDFS tests is painstakingly slow right now because of this bug.

        Show
        Todd Lipcon added a comment - Anyone want to commit this? Running the HDFS tests is painstakingly slow right now because of this bug.
        Hide
        Eli Collins added a comment -

        Thanks Hairong. Maybe a good comment to add to that code.

        Show
        Eli Collins added a comment - Thanks Hairong. Maybe a good comment to add to that code.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12426805/hdfs-797.txt
        against trunk revision 886322.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/79/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/79/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/79/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/79/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12426805/hdfs-797.txt against trunk revision 886322. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/79/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/79/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/79/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/79/console This message is automatically generated.
        Hide
        Hairong Kuang added a comment -

        No need to call closeInternal(). DataStream calls closeInternal() before it exits.

        Show
        Hairong Kuang added a comment - No need to call closeInternal(). DataStream calls closeInternal() before it exits.
        Hide
        Eli Collins added a comment -

        Patch looks good, however it looks like close should call closeInternal since if we interrupt run the streams won't be closed. Note that the comment indicates the DFSOutputStream is closed but that's not the case. Hairong or Hong, what do you guys think?

        Show
        Eli Collins added a comment - Patch looks good, however it looks like close should call closeInternal since if we interrupt run the streams won't be closed. Note that the comment indicates the DFSOutputStream is closed but that's not the case. Hairong or Hong, what do you guys think?
        Hide
        Todd Lipcon added a comment -

        When closing DFSClient without force, it wasn't notifying dataQueue and also wasn't interrupting. This patch adds a dataqueue.notifyAll() even in the non-force case.

        No test included since this is timing related and only occurs on some systems. To test it manually I changed the dataQueue.wait to wait for 10 seconds and ran TestHDFSCLI with test output showing. Before the patch, it was pausing 10 seconds in between each test case. With the patch, it did not noticeably pause.

        This should be committed to 0.21 and 0.22

        Show
        Todd Lipcon added a comment - When closing DFSClient without force, it wasn't notifying dataQueue and also wasn't interrupting. This patch adds a dataqueue.notifyAll() even in the non-force case. No test included since this is timing related and only occurs on some systems. To test it manually I changed the dataQueue.wait to wait for 10 seconds and ran TestHDFSCLI with test output showing. Before the patch, it was pausing 10 seconds in between each test case. With the patch, it did not noticeably pause. This should be committed to 0.21 and 0.22
        Hide
        Konstantin Boudnik added a comment -

        In my case both machines have just 2 cores

        Show
        Konstantin Boudnik added a comment - In my case both machines have just 2 cores
        Hide
        Todd Lipcon added a comment -

        One thought: could be a race that is triggered more often on machines with fewer cores. The build machine I was testing on has 4 cores, whereas my laptop has 2.

        Show
        Todd Lipcon added a comment - One thought: could be a race that is triggered more often on machines with fewer cores. The build machine I was testing on has 4 cores, whereas my laptop has 2.
        Hide
        Konstantin Boudnik added a comment -

        Well, I've ran this test a few time in different configurations: Mac OS laptop (0.21 and 0.22) and my linux box (0.22 only). The results are strange indeed:

        • BSD laptop: the full run takes about 6 (sic!) minutes
        • linux box: 75 to 80 seconds
        Show
        Konstantin Boudnik added a comment - Well, I've ran this test a few time in different configurations: Mac OS laptop (0.21 and 0.22) and my linux box (0.22 only). The results are strange indeed: BSD laptop: the full run takes about 6 (sic!) minutes linux box: 75 to 80 seconds
        Hide
        Todd Lipcon added a comment -

        Strange. It reproducibly takes 12-15 minutes on my laptop (Linux, Sun JDK 1.6.0_16) with a clean checkout. On one of our build boxes (Linux, JDK 1.6.0_14) it takes around 2 minutes like it's supposed to. "Poor man's profiling" of jstacking every few seconds shows most of the traces like this:

        "main" prio=10 tid=0x0000000040d83800 nid=0x147b in Object.wait() [0x00007f738eeea000]
        java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)

        • waiting on <0x00007f7381623b30> (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer)
          at java.lang.Thread.join(Thread.java:1143)
        • locked <0x00007f7381623b30> (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer)
          at java.lang.Thread.join(Thread.java:1196)
          at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeThreads(DFSClient.java:3639)
          at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3683)
        • locked <0x00007f7381622e58> (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream)
          at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
          at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
          at org.apache.hadoop.fs.FsShell.touchz(FsShell.java:818)

        Can anyone else reproduce?

        Show
        Todd Lipcon added a comment - Strange. It reproducibly takes 12-15 minutes on my laptop (Linux, Sun JDK 1.6.0_16) with a clean checkout. On one of our build boxes (Linux, JDK 1.6.0_14) it takes around 2 minutes like it's supposed to. "Poor man's profiling" of jstacking every few seconds shows most of the traces like this: "main" prio=10 tid=0x0000000040d83800 nid=0x147b in Object.wait() [0x00007f738eeea000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) waiting on <0x00007f7381623b30> (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer) at java.lang.Thread.join(Thread.java:1143) locked <0x00007f7381623b30> (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer) at java.lang.Thread.join(Thread.java:1196) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeThreads(DFSClient.java:3639) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3683) locked <0x00007f7381622e58> (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) at org.apache.hadoop.fs.FsShell.touchz(FsShell.java:818) Can anyone else reproduce?
        Hide
        Konstantin Boudnik added a comment -

        Running the same test locally (BSD, Linux) and looking into latest successful Hudson build for the trunk shows that TestHDFSCLI normally completes in between 1 min 40 seconds to 2 min sharp.

        The reported behavior hasn't been observed before.

        Show
        Konstantin Boudnik added a comment - Running the same test locally (BSD, Linux) and looking into latest successful Hudson build for the trunk shows that TestHDFSCLI normally completes in between 1 min 40 seconds to 2 min sharp. The reported behavior hasn't been observed before.

          People

          • Assignee:
            Todd Lipcon
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development