Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6633

AM should retry map attempts if the reduce task encounters commpression related errors.

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.2
    • Fix Version/s: 2.8.0, 2.7.3, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      When reduce task encounters compression related errors, AM doesn't retry the corresponding map task.
      In one of the case we encountered, here is the stack trace.

      2016-01-27 13:44:28,915 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#29
      	at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
      	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
      	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:422)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
      	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
      Caused by: java.lang.ArrayIndexOutOfBoundsException
      	at com.hadoop.compression.lzo.LzoDecompressor.setInput(LzoDecompressor.java:196)
      	at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:104)
      	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
      	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
      	at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.shuffle(InMemoryMapOutput.java:97)
      	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:537)
      	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336)
      	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
      

      In this case, the node on which the map task ran had a bad drive.
      If the AM had retried running that map task somewhere else, the job definitely would have succeeded.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                shahrs87 Rushabh S Shah
                Reporter:
                shahrs87 Rushabh S Shah
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: