Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.2, 0.20.205.0
    • Fix Version/s: 0.20.205.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      I just committed this. Thanks Anupam!

      Description

      During the execution of a large job, the map tasks timeout:

      INFO mapred.JobClient: Task Id : attempt_201010290414_60974_m_000057_1, Status : FAILED
      Task attempt_201010290414_60974_m_000057_1 failed to report status for 609 seconds. Killing!
      

      The bug is in the fact that the mapper has already finished, and, according to the logs, the timeout occurs during the merge sort phase.
      The intermediate data generated by the map task is quite large. So I think this is the problem.

      The logs show that the merge-sort was running for 10 minutes when the task was killed.
      I think the mapred.Merger should call Reporter.progress() somewhere.

      1. MAPREDUCE-2187-trunk-v3.patch
        8 kB
        Anupam Seth
      2. MAPREDUCE-2187-trunk-v2.patch
        7 kB
        Anupam Seth
      3. MAPREDUCE-2187-trunk.patch
        6 kB
        Anupam Seth
      4. MAPREDUCE-2187-MR-279-v2.patch
        8 kB
        Anupam Seth
      5. MAPREDUCE-2187-branch-MR-279.patch
        7 kB
        Anupam Seth
      6. MAPREDUCE-2187-22.patch
        6 kB
        Anupam Seth
      7. MAPREDUCE-2187-20-security-v2.patch
        7 kB
        Anupam Seth
      8. MAPREDUCE-2187-20-security.patch
        6 kB
        Anupam Seth

        Issue Links

          Activity

          Hide
          Matt Foley added a comment -

          Closed upon release of 0.20.205.0

          Show
          Matt Foley added a comment - Closed upon release of 0.20.205.0
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #753 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/753/)
          MAPREDUCE-2187 - Missed adding the file

          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156962
          Files :

          • /hadoop/common/trunk/mapreduce/src/test/mapred/org/apache/hadoop/mapred/TestCombineOutputCollector.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #753 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/753/ ) MAPREDUCE-2187 - Missed adding the file acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156962 Files : /hadoop/common/trunk/mapreduce/src/test/mapred/org/apache/hadoop/mapred/TestCombineOutputCollector.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #736 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/736/)
          MAPREDUCE-2187 - Missed adding the file

          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156962
          Files :

          • /hadoop/common/trunk/mapreduce/src/test/mapred/org/apache/hadoop/mapred/TestCombineOutputCollector.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #736 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/736/ ) MAPREDUCE-2187 - Missed adding the file acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156962 Files : /hadoop/common/trunk/mapreduce/src/test/mapred/org/apache/hadoop/mapred/TestCombineOutputCollector.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #762 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/762/)
          MAPREDUCE-2187 - Missed adding the file

          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156962
          Files :

          • /hadoop/common/trunk/mapreduce/src/test/mapred/org/apache/hadoop/mapred/TestCombineOutputCollector.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #762 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/762/ ) MAPREDUCE-2187 - Missed adding the file acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156962 Files : /hadoop/common/trunk/mapreduce/src/test/mapred/org/apache/hadoop/mapred/TestCombineOutputCollector.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #751 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/751/)
          MAPREDUCE-2187. Reporter sends progress during sort/merge. Contributed by Anupam Seth.

          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1152964
          Files :

          • /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/ReduceTask.java
          • /hadoop/common/trunk/mapreduce/src/java/mapred-default.xml
          • /hadoop/common/trunk/mapreduce/CHANGES.txt
          • /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/MapTask.java
          • /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/Task.java
          • /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapreduce/MRJobConfig.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #751 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/751/ ) MAPREDUCE-2187 . Reporter sends progress during sort/merge. Contributed by Anupam Seth. acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1152964 Files : /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/ReduceTask.java /hadoop/common/trunk/mapreduce/src/java/mapred-default.xml /hadoop/common/trunk/mapreduce/CHANGES.txt /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/MapTask.java /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/Task.java /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapreduce/MRJobConfig.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #760 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/760/)
          MAPREDUCE-2187. Reporter sends progress during sort/merge. Contributed by Anupam Seth.

          acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1152964
          Files :

          • /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/ReduceTask.java
          • /hadoop/common/trunk/mapreduce/src/java/mapred-default.xml
          • /hadoop/common/trunk/mapreduce/CHANGES.txt
          • /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/MapTask.java
          • /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/Task.java
          • /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapreduce/MRJobConfig.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #760 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/760/ ) MAPREDUCE-2187 . Reporter sends progress during sort/merge. Contributed by Anupam Seth. acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1152964 Files : /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/ReduceTask.java /hadoop/common/trunk/mapreduce/src/java/mapred-default.xml /hadoop/common/trunk/mapreduce/CHANGES.txt /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/MapTask.java /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapred/Task.java /hadoop/common/trunk/mapreduce/src/java/org/apache/hadoop/mapreduce/MRJobConfig.java
          Hide
          Anupam Seth added a comment -

          Thanks a lot Arun!

          I am attaching a revised patch for trunk.

          Show
          Anupam Seth added a comment - Thanks a lot Arun! I am attaching a revised patch for trunk.
          Hide
          Arun C Murthy added a comment -

          Anupam - please define the config key in MRJobConfig for trunk.

          I've committed the patch to 0.20.205. Thanks.

          Show
          Arun C Murthy added a comment - Anupam - please define the config key in MRJobConfig for trunk. I've committed the patch to 0.20.205. Thanks.
          Hide
          Anupam Seth added a comment -

          Thanks Arun, I am uploading the patches with constants replacing the direct values. I chose 10000 because it is the default value for mapreduce.task.merge.progress.records in Merger#writeFile which has similar logic. Please let me know if there is a better default value I should be using.

          Show
          Anupam Seth added a comment - Thanks Arun, I am uploading the patches with constants replacing the direct values. I chose 10000 because it is the default value for mapreduce.task.merge.progress.records in Merger#writeFile which has similar logic. Please let me know if there is a better default value I should be using.
          Hide
          Arun C Murthy added a comment -

          The patch looks straight-forward.

          +      progressBar = conf.getLong("mapred.combine.recordsBeforeProgress", 10000);
          

          I'd declare both the key and value as appropriate constants.

          Also, why did you pick 10000 as the default?

          Show
          Arun C Murthy added a comment - The patch looks straight-forward. + progressBar = conf.getLong( "mapred.combine.recordsBeforeProgress" , 10000); I'd declare both the key and value as appropriate constants. Also, why did you pick 10000 as the default?
          Hide
          Robert Joseph Evans added a comment -

          +1 on all of the patches. The changes look good to me. I am not wild about the name progressBar for how often to make progress, but it is not something to keep the patch out.

          Show
          Robert Joseph Evans added a comment - +1 on all of the patches. The changes look good to me. I am not wild about the name progressBar for how often to make progress, but it is not something to keep the patch out.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12483443/MAPREDUCE-2187-branch-MR-279.patch
          against trunk revision 1145889.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/465//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12483443/MAPREDUCE-2187-branch-MR-279.patch against trunk revision 1145889. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/465//console This message is automatically generated.
          Hide
          Anupam Seth added a comment -

          Running ant test-patch produces:

          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          [exec]
          [exec] +1 system test framework. The patch passed system test framework compile.
          [exec]

          The findbugs warning existed in trunk before my changes as well as evident in Jenkins (https://builds.apache.org/view/G-L/view/Hadoop/job/Hadoop-Mapreduce-trunk/703/artifact/trunk/findbugs/hadoop-findbugs-report.html).

          An identical patch also applies to branch-0.22, however, pre-existing compilation failures on it prevent me from validating it. Uploading it nevertheless for completeness.

          Show
          Anupam Seth added a comment - Running ant test-patch produces: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system test framework. The patch passed system test framework compile. [exec] The findbugs warning existed in trunk before my changes as well as evident in Jenkins ( https://builds.apache.org/view/G-L/view/Hadoop/job/Hadoop-Mapreduce-trunk/703/artifact/trunk/findbugs/hadoop-findbugs-report.html ). An identical patch also applies to branch-0.22, however, pre-existing compilation failures on it prevent me from validating it. Uploading it nevertheless for completeness.
          Hide
          Anupam Seth added a comment -

          Thanks Todd. I am uploading a patch for trunk as well.

          Show
          Anupam Seth added a comment - Thanks Todd. I am uploading a patch for trunk as well.
          Hide
          Todd Lipcon added a comment -

          does this not affect trunk/22?

          Show
          Todd Lipcon added a comment - does this not affect trunk/22?
          Hide
          Anupam Seth added a comment -

          ant test-patch gives the following result:

          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          Show
          Anupam Seth added a comment - ant test-patch gives the following result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
          Hide
          Anupam Seth added a comment -

          As commented by Luke in MAPREDUCE-2177, the combiner code path doesn't report progress at some configured interval, so it can timeout when map output is large. A proposed fix is in CombineOutputCollector#collect to report progress a la Merger#writeFile.

          Show
          Anupam Seth added a comment - As commented by Luke in MAPREDUCE-2177 , the combiner code path doesn't report progress at some configured interval, so it can timeout when map output is large. A proposed fix is in CombineOutputCollector#collect to report progress a la Merger#writeFile.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Well, I worked around the issue by setting mapred.task.timeout

          In any case I think it is an issue, as my tasks finish within the original timeout, so I should not be forced to modify it.
          What times out is the sort/shuffle phase. So some internal class is the culprit.

          Show
          Gianmarco De Francisci Morales added a comment - Well, I worked around the issue by setting mapred.task.timeout In any case I think it is an issue, as my tasks finish within the original timeout, so I should not be forced to modify it. What times out is the sort/shuffle phase. So some internal class is the culprit.
          Hide
          Lianhui Wang added a comment -

          i think that may be tt donot report anything to JT in the merge sort phase.SO JT think the tt failed and then kill the task.
          maybe you can increase tasktrackerExpiryInterval in the JobTracker class.

          Show
          Lianhui Wang added a comment - i think that may be tt donot report anything to JT in the merge sort phase.SO JT think the tt failed and then kill the task. maybe you can increase tasktrackerExpiryInterval in the JobTracker class.
          Hide
          Gianmarco De Francisci Morales added a comment -

          I forgot to mention. I am using Hadoop 0.20.201 from Yahoo's distribution.
          I don't know whether there are differences related to this behavior.

          Show
          Gianmarco De Francisci Morales added a comment - I forgot to mention. I am using Hadoop 0.20.201 from Yahoo's distribution. I don't know whether there are differences related to this behavior.
          Hide
          Gianmarco De Francisci Morales added a comment -

          I forgot I actually do have a combiner.
          In the combiner, I call reporter.progress() every 1000 values.

          The numbers are a bit high for the number of mappers/reducers I am using, but it was an exploratory job.

          Map input records 454,219
          Map output records 29,528,547,433
          Map output bytes 503,179,031,513
          Combine output records map=56,287,259,615 red=13,553,888,779
          Reduce input records 15,567,573,707
          Reduce output records 2,509,983
          Reduce shuffle bytes 337,876,374,027

          The first time I tried with 400 mappers and 100 reducers, and I had the timeouts.
          The job manages to end with 2000 mappers and 200 reducers.
          I tried with a larger input and I had the same timeouts.

          The size of the record shouldn't be that large.
          The key is always an int pair.
          The value is either an int-float pair (most of them, 29,528,011,793) or an array of long-double pairs (535,640 records, for a total size of 649,693,592 bytes). I am using MultipleInputs and GenericWritable to shuffle them together, can this be the culprit?

          Show
          Gianmarco De Francisci Morales added a comment - I forgot I actually do have a combiner. In the combiner, I call reporter.progress() every 1000 values. The numbers are a bit high for the number of mappers/reducers I am using, but it was an exploratory job. Map input records 454,219 Map output records 29,528,547,433 Map output bytes 503,179,031,513 Combine output records map=56,287,259,615 red=13,553,888,779 Reduce input records 15,567,573,707 Reduce output records 2,509,983 Reduce shuffle bytes 337,876,374,027 The first time I tried with 400 mappers and 100 reducers, and I had the timeouts. The job manages to end with 2000 mappers and 200 reducers. I tried with a larger input and I had the same timeouts. The size of the record shouldn't be that large. The key is always an int pair. The value is either an int-float pair (most of them, 29,528,011,793) or an array of long-double pairs (535,640 records, for a total size of 649,693,592 bytes). I am using MultipleInputs and GenericWritable to shuffle them together, can this be the culprit?
          Hide
          Luke Lu added a comment -

          Interesting, number of records, size of the output would be helpful. If the record size is large enough, the default 10000 records report interval could be too large.

          Show
          Luke Lu added a comment - Interesting, number of records, size of the output would be helpful. If the record size is large enough, the default 10000 records report interval could be too large.
          Hide
          Gianmarco De Francisci Morales added a comment -

          No combiner and 200 reducers when I tried.
          Raising the number of mappers and reducers makes the issue disappear.
          But it is not a solution, I think.

          I can give more details on the job if needed (number of records, size of the output, etc...)

          Show
          Gianmarco De Francisci Morales added a comment - No combiner and 200 reducers when I tried. Raising the number of mappers and reducers makes the issue disappear. But it is not a solution, I think. I can give more details on the job if needed (number of records, size of the output, etc...)
          Hide
          Luke Lu added a comment -

          Do you have combiner configured? How many reduces?

          The reason I'm asking is that the if you don't have combiner set, the progress is reported at least once per partition and per 10000 (mapred.merge.recordsBeforeProgress) records.

          Show
          Luke Lu added a comment - Do you have combiner configured? How many reduces? The reason I'm asking is that the if you don't have combiner set, the progress is reported at least once per partition and per 10000 (mapred.merge.recordsBeforeProgress) records.

            People

            • Assignee:
              Anupam Seth
              Reporter:
              Gianmarco De Francisci Morales
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development