Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1694

lzo compressed input files not properly recognized

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 0.14.0
    • None
    • None
    • None

    Description

      When running the wordcount example with text, gzip and lzo compressed input files, the lzo compressed input files are not properly recognized and are treated as text files.

      With an input dir of

      /user/hadoopqa/input/part-001.txt
      /user/hadoopqa/input/part-002.txt.gz
      /user/hadoopqa/input/part-003.txt.lzo

      and running this command

      bin/hadoopqa jar hadoop-examples.jar wordcount /user/hadoopqa/input /user/hadoopqa/output

      I get output that looks like

      row 4
      royal 4
      rt$3-ex?ÔøΩ?÷µIStÔøΩ"4D%ÔøΩ9$UÔøΩÔøΩ"ÔøΩ, 1
      ru$ÔøΩÔøΩ#~t"@ÔøΩm*d#\/$ÔøΩÔøΩl.t"XÔøΩÔøΩDi" 1
      rubbÔøΩdÔøΩ&@bT 1
      rubbed 2

      To lzo compress the file I used lzop:
      http://www.lzop.org/download/lzop-1.01-linux_i386.tar.gz

      Attachments

        1. part-201.txt.lzo
          80 kB
          Nigel Daley

        Issue Links

          Activity

            tahir Tahir Hashmi added a comment -

            The quoted text doesn't look like a direct dump of LZO header. Could you please give the first couple of lines of "hexdump -C" on the lzo compressed file? Maybe the file header is corrupted. Meanwhile I'll check if there's something wrong with the file recognition code.

            tahir Tahir Hashmi added a comment - The quoted text doesn't look like a direct dump of LZO header. Could you please give the first couple of lines of "hexdump -C" on the lzo compressed file? Maybe the file header is corrupted. Meanwhile I'll check if there's something wrong with the file recognition code.
            omalley Owen O'Malley added a comment -

            No, Nigel sent the output of word count, so that is just part of the compressed file that got interpreted as "words". The problem, I suspect that that the lzo file extension is not in the default config.

            omalley Owen O'Malley added a comment - No, Nigel sent the output of word count, so that is just part of the compressed file that got interpreted as "words". The problem, I suspect that that the lzo file extension is not in the default config.
            nidaley Nigel Daley added a comment -

            I'm running the job as follows with lzo library installed on the cluster:

            hadoop --config ~/c jar $HADOOP_HOME/hadoop-0.15-examples.jar wordcount \
            -Dio.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.LzoCodec \
            /user/hadoopqa/validation/data/wordCountInput \
            /user/hadoopqa/validation/data/mapredWordCountOutput

            The map that gets the 1 .lzo file, it throws this exception:

            ...
            07/12/07 09:38:20 INFO mapred.JobClient: map 57% reduce 0%
            07/12/07 09:38:20 INFO mapred.JobClient: Task Id : task_200712070937_0001_m_000010_0, Status : FAILED
            java.io.EOFException
            at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:106)
            at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
            at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
            at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
            at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:136)
            at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:128)
            at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:117)
            at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:39)
            at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:174)
            at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
            at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

            nidaley Nigel Daley added a comment - I'm running the job as follows with lzo library installed on the cluster: hadoop --config ~/c jar $HADOOP_HOME/hadoop-0.15-examples.jar wordcount \ -Dio.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.LzoCodec \ /user/hadoopqa/validation/data/wordCountInput \ /user/hadoopqa/validation/data/mapredWordCountOutput The map that gets the 1 .lzo file, it throws this exception: ... 07/12/07 09:38:20 INFO mapred.JobClient: map 57% reduce 0% 07/12/07 09:38:20 INFO mapred.JobClient: Task Id : task_200712070937_0001_m_000010_0, Status : FAILED java.io.EOFException at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:106) at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:136) at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:128) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:117) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:39) at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:174) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
            nidaley Nigel Daley added a comment -

            Attaching my .lzo file. It was created with lzop 1.01.

            nidaley Nigel Daley added a comment - Attaching my .lzo file. It was created with lzop 1.01.
            cdouglas Christopher Douglas added a comment - Fixed in HADOOP-2664

            People

              acmurthy Arun Murthy
              nidaley Nigel Daley
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: