Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.2
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Resolves the test failure by modifying the test to base it on spill counters rather than on bytes read/written. It also introduces a new configuration parameter "mapred.job.shuffle.input.buffer.percent" to provide finer grained control on the memory limit to be used during shuffle.
      Show
      Resolves the test failure by modifying the test to base it on spill counters rather than on bytes read/written. It also introduces a new configuration parameter "mapred.job.shuffle.input.buffer.percent" to provide finer grained control on the memory limit to be used during shuffle.

      Description

      Testcase: testReduceFromMem took 23.625 sec
      	FAILED
      Non-zero read from local: 83
      junit.framework.AssertionFailedError: Non-zero read from local: 83
      	at org.apache.hadoop.mapred.TestReduceFetch.testReduceFromMem(TestReduceFetch.java:289)
      	at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
      	at junit.extensions.TestSetup$1.protect(TestSetup.java:23)
      	at junit.extensions.TestSetup.run(TestSetup.java:27)
      

      Ran TestReduceFetch a few times on a clean trunk. It failed consistently.

      1. testReduceFromMem.txt
        287 kB
        Tsz Wo Nicholas Sze
      2. TEST-org.apache.hadoop.mapred.TestReduceFetch.txt
        302 kB
        Devaraj Das
      3. TEST-org.apache.hadoop.mapred.TestReduceFetch.txt
        288 kB
        Tsz Wo Nicholas Sze
      4. M433-v20.another.patch
        5 kB
        Devaraj Das
      5. M433-1y20.patch
        5 kB
        Chris Douglas
      6. M433-1v20.patch
        5 kB
        Chris Douglas
      7. M433-0v20.patch
        5 kB
        Chris Douglas
      8. FAILING-PARTIALMEM-TEST-org.apache.hadoop.mapred.TestReduceFetch.txt
        288 kB
        Devaraj Das
      9. 6029-2.patch
        6 kB
        Chris Douglas
      10. 6029-1.patch
        5 kB
        Chris Douglas
      11. 6029-0.patch
        5 kB
        Chris Douglas

        Issue Links

          Activity

          Hide
          Devaraj Das added a comment -

          Jothi and I came across another TestReduceFetch failure.

          Testcase: testReduceFromDisk took 78.436 sec
          Testcase: testReduceFromPartialMem took 60.701 sec
                  FAILED
          Expected at least 1MB fewer bytes read from local (21159650) than written to HDFS (21036680)
          junit.framework.AssertionFailedError: Expected at least 1MB fewer bytes read from local (21159650) than written to HDFS (21036680)
                  at org.apache.hadoop.mapred.TestReduceFetch.testReduceFromPartialMem(TestReduceFetch.java:276)
                  at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
                  at junit.extensions.TestSetup$1.protect(TestSetup.java:23)
                  at junit.extensions.TestSetup.run(TestSetup.java:27)
          
          Testcase: testReduceFromMem took 52.097 sec
          

          The above failure actually looks like a memory issue. In ReduceTask.ReduceCopier.ShuffleRamManager, a memory reservation is done for in-memory shuffle, and that uses Runtime.getRuntime().maxMemory(). The return value of this seems to be machine-dependent. For the case where it failed with the exception trace above, the value returned by Runtime.maxMemory is smaller compared to the case using which the test passes. When the former happens, shuffled files start hitting the disk, and the testcase fails since it doesn't expect that many files to hit the disk.. I am attaching two logs - one of the successful testcase (all tests successful) and another for the failed testReduceFromPartialMem run. In both the logs, job_0002 is the job for the testReduceFromPartialMem test.

          Nicholas, could you please upload the logs of the test failure you saw. Thanks!

          Show
          Devaraj Das added a comment - Jothi and I came across another TestReduceFetch failure. Testcase: testReduceFromDisk took 78.436 sec Testcase: testReduceFromPartialMem took 60.701 sec FAILED Expected at least 1MB fewer bytes read from local (21159650) than written to HDFS (21036680) junit.framework.AssertionFailedError: Expected at least 1MB fewer bytes read from local (21159650) than written to HDFS (21036680) at org.apache.hadoop.mapred.TestReduceFetch.testReduceFromPartialMem(TestReduceFetch.java:276) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24) at junit.extensions.TestSetup$1.protect(TestSetup.java:23) at junit.extensions.TestSetup.run(TestSetup.java:27) Testcase: testReduceFromMem took 52.097 sec The above failure actually looks like a memory issue. In ReduceTask.ReduceCopier.ShuffleRamManager, a memory reservation is done for in-memory shuffle, and that uses Runtime.getRuntime().maxMemory(). The return value of this seems to be machine-dependent. For the case where it failed with the exception trace above, the value returned by Runtime.maxMemory is smaller compared to the case using which the test passes. When the former happens, shuffled files start hitting the disk, and the testcase fails since it doesn't expect that many files to hit the disk.. I am attaching two logs - one of the successful testcase (all tests successful) and another for the failed testReduceFromPartialMem run. In both the logs, job_0002 is the job for the testReduceFromPartialMem test. Nicholas, could you please upload the logs of the test failure you saw. Thanks!
          Hide
          Chris Douglas added a comment -

          In ReduceTask.ReduceCopier.ShuffleRamManager, a memory reservation is done for in-memory shuffle, and that uses Runtime.getRuntime().maxMemory(). The return value of this seems to be machine-dependent.

          This should be rigged by TestReduceFetch in mapred.child.java.opts, and match -Xmx128m.

          testReduceFromPartialMem is an awkward test to write. Its intent is to configure the reduce so that- presented with a set of crafted map outputs- it will make a particular guess about how to optimize its I/O. If we can't rig the total memory because -Xmx has a machine-dependent interpretation, then writing such a test will be a real pain with our current set of configuration options. We could add a parameter for the memory reservation that defaults to querying the runtime; that would let us be certain of our memory limit, but not burden the user with setting it. I'm really surprised that setting -Xmx doesn't work, though...

          The failure for testReduceFromMem suggests this should use the counters added in HADOOP-2774 rather than the FileSystem counters to validate the result.

          Show
          Chris Douglas added a comment - In ReduceTask.ReduceCopier.ShuffleRamManager, a memory reservation is done for in-memory shuffle, and that uses Runtime.getRuntime().maxMemory(). The return value of this seems to be machine-dependent. This should be rigged by TestReduceFetch in mapred.child.java.opts, and match -Xmx128m . testReduceFromPartialMem is an awkward test to write. Its intent is to configure the reduce so that- presented with a set of crafted map outputs- it will make a particular guess about how to optimize its I/O. If we can't rig the total memory because -Xmx has a machine-dependent interpretation, then writing such a test will be a real pain with our current set of configuration options. We could add a parameter for the memory reservation that defaults to querying the runtime; that would let us be certain of our memory limit, but not burden the user with setting it. I'm really surprised that setting -Xmx doesn't work, though... The failure for testReduceFromMem suggests this should use the counters added in HADOOP-2774 rather than the FileSystem counters to validate the result.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Nicholas, could you please upload the logs of the test failure you saw. Thanks!

          Here you go: TEST-org.apache.hadoop.mapred.TestReduceFetch.txt

          Show
          Tsz Wo Nicholas Sze added a comment - > Nicholas, could you please upload the logs of the test failure you saw. Thanks! Here you go: TEST-org.apache.hadoop.mapred.TestReduceFetch.txt
          Hide
          Devaraj Das added a comment -

          Nicholas, can you please upload the logs of the test where you see testReduceFromMem failing (as on the jira description).
          Chris, we did read somewhere that maxMemory() is not necessarily a function of -Xmx, and there are some quirks there. Let me try to get that link.
          Counters added by HADOOP-2774 looks like a better candidate for the TestReduceFetch use case.

          Show
          Devaraj Das added a comment - Nicholas, can you please upload the logs of the test where you see testReduceFromMem failing (as on the jira description). Chris, we did read somewhere that maxMemory() is not necessarily a function of -Xmx, and there are some quirks there. Let me try to get that link. Counters added by HADOOP-2774 looks like a better candidate for the TestReduceFetch use case.
          Hide
          Jothi Padmanabhan added a comment -

          Here are a couple of links that explain -Xmx value and maxMemory() will be different.

          http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4391499
          http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4686462

          A comment from the second link –

          " ... freeMemory() and totalMemory()
          report the amount of memory inside the jvm while
          maxMemory() reports on the amount of memory outside the
          jvm, i.e. the amount the whole jvm uses as seen from the OS."

          Show
          Jothi Padmanabhan added a comment - Here are a couple of links that explain -Xmx value and maxMemory() will be different. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4391499 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4686462 A comment from the second link – " ... freeMemory() and totalMemory() report the amount of memory inside the jvm while maxMemory() reports on the amount of memory outside the jvm, i.e. the amount the whole jvm uses as seen from the OS."
          Hide
          Jothi Padmanabhan added a comment -

          Nicholas, can you please upload the logs of the test where you see testReduceFromMem failing (as on the jira description).

          Just to clarify – could you upload the logs where testReduceFromMem failed. We are able to get testReduceFromPartialMem fail consistently on some machines, but testReduceFromMem is passing.

          Show
          Jothi Padmanabhan added a comment - Nicholas, can you please upload the logs of the test where you see testReduceFromMem failing (as on the jira description). Just to clarify – could you upload the logs where testReduceFromMem failed. We are able to get testReduceFromPartialMem fail consistently on some machines, but testReduceFromMem is passing.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Just to clarify - could you upload the logs where testReduceFromMem failed. We are able to get testReduceFromPartialMem fail consistently on some machines, but testReduceFromMem is passing.
          Oops, I missed this. Here is the log: testReduceFromMem.txt

          Show
          Tsz Wo Nicholas Sze added a comment - > Just to clarify - could you upload the logs where testReduceFromMem failed. We are able to get testReduceFromPartialMem fail consistently on some machines, but testReduceFromMem is passing. Oops, I missed this. Here is the log: testReduceFromMem.txt
          Hide
          Devaraj Das added a comment -

          Although the fix looks right from the testcase point of view, but IMO we should still investigate why and where this 83 bytes from local disk are read (as logged in Nicholas's testReduceFromMem failure)

          Show
          Devaraj Das added a comment - Although the fix looks right from the testcase point of view, but IMO we should still investigate why and where this 83 bytes from local disk are read (as logged in Nicholas's testReduceFromMem failure)
          Hide
          Chris Douglas added a comment -

          The 83 byte read was introduced by HADOOP-4374. It added a rename in writeToIndexFile, but File::renameTo fails on windows so it's effected using IOUtils::copyBytes.

          Show
          Chris Douglas added a comment - The 83 byte read was introduced by HADOOP-4374 . It added a rename in writeToIndexFile, but File::renameTo fails on windows so it's effected using IOUtils::copyBytes.
          Hide
          Chris Douglas added a comment -

          Update paths for project split

          Show
          Chris Douglas added a comment - Update paths for project split
          Hide
          Devaraj Das added a comment -

          Looks good. One comment:
          Since we have seen issues with Runtime APIs to do with getting max memory, etc. (see Jothi's earlier comment), and so am wondering whether we should just anyway set some values based on what test we are running (smaller value for the partial mem case, and bigger value for the in-mem case). We should investigate (in a separate jira) whether the anomalies in the Runtime APIs is something we should worry about and fix.

          Show
          Devaraj Das added a comment - Looks good. One comment: Since we have seen issues with Runtime APIs to do with getting max memory, etc. (see Jothi's earlier comment), and so am wondering whether we should just anyway set some values based on what test we are running (smaller value for the partial mem case, and bigger value for the in-mem case). We should investigate (in a separate jira) whether the anomalies in the Runtime APIs is something we should worry about and fix.
          Hide
          Chris Douglas added a comment -

          Changed patch to compare spilled records against map output records, rather than calculating the latter in the assert. Added comments.

               [exec] +1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
          

          All unit tests passed on my machine, save TestJobInProgressListener and TestJobTrackerRestart, which also failed when I tried them on trunk.

          Show
          Chris Douglas added a comment - Changed patch to compare spilled records against map output records, rather than calculating the latter in the assert. Added comments. [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. All unit tests passed on my machine, save TestJobInProgressListener and TestJobTrackerRestart, which also failed when I tried them on trunk.
          Hide
          Chris Douglas added a comment -

          (the preceding addresses Devaraj's comments)

          Show
          Chris Douglas added a comment - (the preceding addresses Devaraj's comments)
          Hide
          Chris Douglas added a comment -

          I committed this.

          (tried TestJobInProgressListener again, and it passed)

          Show
          Chris Douglas added a comment - I committed this. (tried TestJobInProgressListener again, and it passed)
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #15 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/15/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #15 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/15/ )
          Hide
          Chris Douglas added a comment -

          Per MAPREDUCE-1172, this needs to be ported to 0.20

          Show
          Chris Douglas added a comment - Per MAPREDUCE-1172 , this needs to be ported to 0.20
          Hide
          Chris Douglas added a comment -

          Patch for 0.20

          Show
          Chris Douglas added a comment - Patch for 0.20
          Hide
          Chris Douglas added a comment -
               [exec] +1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          
          Show
          Chris Douglas added a comment - [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          Hide
          Devaraj Das added a comment -

          Patch for Yahoo 20 branch (not to be committed)

          Show
          Devaraj Das added a comment - Patch for Yahoo 20 branch (not to be committed)
          Hide
          Chris Douglas added a comment -

          Committed to 0.20

          Show
          Chris Douglas added a comment - Committed to 0.20
          Hide
          Chris Douglas added a comment -

          Bad merge. Fixed.

          Show
          Chris Douglas added a comment - Bad merge. Fixed.
          Hide
          Chris Douglas added a comment -

          I committed this.

          Show
          Chris Douglas added a comment - I committed this.
          Hide
          Chris Douglas added a comment -

          Patch for Y! 0.20

          Show
          Chris Douglas added a comment - Patch for Y! 0.20

            People

            • Assignee:
              Chris Douglas
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development