Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5119

Splitting issue when using NLineInputFormat with compression

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Reopened
    • Minor
    • Resolution: Unresolved
    • 1.1.2
    • None
    • None
    • None
    • Try in Apache Hadoop 1.1.1, CDH4, and Amazon EMR. Same result.

    Description

      #make a long text line. It seems only long line text causing issue.
      $ cat abook.txt | base64 –w 0 >onelinetext.b64 #200KB+ long
      $ hadoop fs –put onelinetext.b64 /input/onelinetext.b64
      $ hadoop jar hadoop-streaming.jar \
      -input /input/onelinetext.b64 \
      -output /output \
      -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
      –mapper wc
      Num task: 1, and output has one line:
      Line 1: 1 2 202699
      which makes sense because one line per mapper is intended.

      Then, using compression with NLineInputFormat
      $ gzip onelinetext.b64
      $ hadoop fs –put onelinetext.b64.bz2 /input/onelinetext.b64.gz
      $ hadoop jar hadoop-streaming.jar \
      -Dmapred.input.compress=true \
      -Dmapred.input.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
      -input /input/onelinetext.b64.gz \
      -output /output \
      -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
      –mapper wc
      I am expecting the same results as above, 'coz decompressing should occur before processing one-line text (i.e. wc), however, I am getting:

      Num task: 397 (or other large numbers depend on environments), and output has 397 lines:
      Line1-396: 0 0 0
      Line 397: 1 2 202699

      Any idea why so many mapred.map.tasks >>1? Is it incorrect splitting? I purposely choose gzip because I believe it is NOT split-able. I got similar results when using bzip2 and lzop codecs.

      Attachments

        Activity

          People

            Unassigned Unassigned
            openresearch Qiming He
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: