[MAPREDUCE-5119] Splitting issue when using NLineInputFormat with compression - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Reopened
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.1.2
Fix Version/s: None
Component/s: None
Labels:
None
Environment:

Try in Apache Hadoop 1.1.1, CDH4, and Amazon EMR. Same result.

Description

#make a long text line. It seems only long line text causing issue.
$ cat abook.txt | base64 –w 0 >onelinetext.b64 #200KB+ long
$ hadoop fs –put onelinetext.b64 /input/onelinetext.b64
$ hadoop jar hadoop-streaming.jar \
-input /input/onelinetext.b64 \
-output /output \
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
–mapper wc
Num task: 1, and output has one line:
Line 1: 1 2 202699
which makes sense because one line per mapper is intended.

Then, using compression with NLineInputFormat
$ gzip onelinetext.b64
$ hadoop fs –put onelinetext.b64.bz2 /input/onelinetext.b64.gz
$ hadoop jar hadoop-streaming.jar \
-Dmapred.input.compress=true \
-Dmapred.input.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input /input/onelinetext.b64.gz \
-output /output \
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
–mapper wc
I am expecting the same results as above, 'coz decompressing should occur before processing one-line text (i.e. wc), however, I am getting:

Num task: 397 (or other large numbers depend on environments), and output has 397 lines:
Line1-396: 0 0 0
Line 397: 1 2 202699

Any idea why so many mapred.map.tasks >>1? Is it incorrect splitting? I purposely choose gzip because I believe it is NOT split-able. I got similar results when using bzip2 and lzop codecs.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Qiming He

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Mar/13 12:47

Updated:: 29/Mar/13 14:33