[HADOOP-1694] lzo compressed input files not properly recognized - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.14.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

When running the wordcount example with text, gzip and lzo compressed input files, the lzo compressed input files are not properly recognized and are treated as text files.

With an input dir of

/user/hadoopqa/input/part-001.txt
/user/hadoopqa/input/part-002.txt.gz
/user/hadoopqa/input/part-003.txt.lzo

and running this command

bin/hadoopqa jar hadoop-examples.jar wordcount /user/hadoopqa/input /user/hadoopqa/output

I get output that looks like

row 4
royal 4
rt$3-ex?ÔøΩ?÷µIStÔøΩ"4D%ÔøΩ9$UÔøΩÔøΩ"ÔøΩ, 1
ru$ÔøΩÔøΩ#~t"@ÔøΩm*d#\/$ÔøΩÔøΩl.t"XÔøΩÔøΩDi" 1
rubbÔøΩdÔøΩ&@bT 1
rubbed 2

To lzo compress the file I used lzop:
http://www.lzop.org/download/lzop-1.01-linux_i386.tar.gz

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

part-201.txt.lzo
07/Dec/07 13:35
80 kB
Nigel Daley

Issue Links

is part of

HADOOP-2664 lzop-compatible CompresionCodec

Closed

Activity

Ascending order - Click to sort in descending order

Tahir Hashmi added a comment - 16/Aug/07 12:14

The quoted text doesn't look like a direct dump of LZO header. Could you please give the first couple of lines of "hexdump -C" on the lzo compressed file? Maybe the file header is corrupted. Meanwhile I'll check if there's something wrong with the file recognition code.

Tahir Hashmi added a comment - 16/Aug/07 12:14 The quoted text doesn't look like a direct dump of LZO header. Could you please give the first couple of lines of "hexdump -C" on the lzo compressed file? Maybe the file header is corrupted. Meanwhile I'll check if there's something wrong with the file recognition code.

Owen O'Malley added a comment - 21/Aug/07 17:46

No, Nigel sent the output of word count, so that is just part of the compressed file that got interpreted as "words". The problem, I suspect that that the lzo file extension is not in the default config.

Owen O'Malley added a comment - 21/Aug/07 17:46 No, Nigel sent the output of word count, so that is just part of the compressed file that got interpreted as "words". The problem, I suspect that that the lzo file extension is not in the default config.

Nigel Daley added a comment - 07/Dec/07 13:23

I'm running the job as follows with lzo library installed on the cluster:

hadoop --config ~/c jar $HADOOP_HOME/hadoop-0.15-examples.jar wordcount \
-Dio.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.LzoCodec \
/user/hadoopqa/validation/data/wordCountInput \
/user/hadoopqa/validation/data/mapredWordCountOutput

The map that gets the 1 .lzo file, it throws this exception:

...
07/12/07 09:38:20 INFO mapred.JobClient: map 57% reduce 0%
07/12/07 09:38:20 INFO mapred.JobClient: Task Id : task_200712070937_0001_m_000010_0, Status : FAILED
java.io.EOFException
at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:106)
at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:136)
at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:128)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:117)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:39)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:174)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

Nigel Daley added a comment - 07/Dec/07 13:23 I'm running the job as follows with lzo library installed on the cluster: hadoop --config ~/c jar $HADOOP_HOME/hadoop-0.15-examples.jar wordcount \ -Dio.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.LzoCodec \ /user/hadoopqa/validation/data/wordCountInput \ /user/hadoopqa/validation/data/mapredWordCountOutput The map that gets the 1 .lzo file, it throws this exception: ... 07/12/07 09:38:20 INFO mapred.JobClient: map 57% reduce 0% 07/12/07 09:38:20 INFO mapred.JobClient: Task Id : task_200712070937_0001_m_000010_0, Status : FAILED java.io.EOFException at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:106) at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:136) at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:128) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:117) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:39) at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:174) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

Nigel Daley added a comment - 07/Dec/07 13:35

Attaching my .lzo file. It was created with lzop 1.01.

Nigel Daley added a comment - 07/Dec/07 13:35 Attaching my .lzo file. It was created with lzop 1.01.

Christopher Douglas added a comment - 12/Nov/08 23:47

Fixed in ~~HADOOP-2664~~

Christopher Douglas added a comment - 12/Nov/08 23:47 Fixed in HADOOP-2664

Hadoop Common

lzo compressed input files not properly recognized

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates