Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
0.14.0
-
None
-
None
-
None
Description
When running the wordcount example with text, gzip and lzo compressed input files, the lzo compressed input files are not properly recognized and are treated as text files.
With an input dir of
/user/hadoopqa/input/part-001.txt
/user/hadoopqa/input/part-002.txt.gz
/user/hadoopqa/input/part-003.txt.lzo
and running this command
bin/hadoopqa jar hadoop-examples.jar wordcount /user/hadoopqa/input /user/hadoopqa/output
I get output that looks like
row 4
royal 4
rt$3-ex?ÔøΩ?÷µIStÔøΩ"4D%ÔøΩ9$UÔøΩÔøΩ"ÔøΩ, 1
ru$ÔøΩÔøΩ#~t"@ÔøΩm*d#\/$ÔøΩÔøΩl.t"XÔøΩÔøΩDi" 1
rubbÔøΩdÔøΩ&@bT 1
rubbed 2
To lzo compress the file I used lzop:
http://www.lzop.org/download/lzop-1.01-linux_i386.tar.gz
Attachments
Attachments
Issue Links
- is part of
-
HADOOP-2664 lzop-compatible CompresionCodec
- Closed
The quoted text doesn't look like a direct dump of LZO header. Could you please give the first couple of lines of "hexdump -C" on the lzo compressed file? Maybe the file header is corrupted. Meanwhile I'll check if there's something wrong with the file recognition code.