Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-6817

SequenceFile.Reader can't read gzip format compressed sequence file, which produce by a mapreduce job, without native compression library

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 0.20.2
    • None
    • io
    • None
    • Cluster:CentOS 5,jdk1.6.0_20
      Client:Mac SnowLeopard,jdk1.6.0_20

    • SequenceFile.Reader,Gzip

    Description

      An hadoop job output a gzip compressed sequence file(whether record compressed or block compressed).The client program use SequenceFile.Reader to read this sequence file,when reading the client program shows the following exceptions:

      2090 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      2091 [main] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
      Exception in thread "main" java.io.EOFException
      at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207)
      at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197)
      at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136)
      at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
      at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)
      at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92)
      at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101)
      at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:170)
      at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:180)
      at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520)
      at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
      at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
      at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
      at com.shiningware.intelligenceonline.taobao.mapreduce.HtmlContentSeqOutputView.main(HtmlContentSeqOutputView.java:28)

      I studied the code in org.apache.hadoop.io.SequenceFile.Reader.init method and read:
      // Initialize... not if this we are constructing a temporary Reader
      if (!tempReader) {
      valBuffer = new DataInputBuffer();
      if (decompress)

      { valDecompressor = CodecPool.getDecompressor(codec); valInFilter = codec.createInputStream(valBuffer, valDecompressor); valIn = new DataInputStream(valInFilter); }

      else

      { valIn = valBuffer; }

      the problem seems to be caused by "valBuffer = new DataInputBuffer();" ,because GzipCodec.createInputStream creates an instance of GzipInputStream whose constructor creates an instance of ResetableGZIPInputStream class.When ResetableGZIPInputStream's constructor calls it base class java.util.zip.GZIPInputStream's constructor ,it trys to read the empty "valBuffer = new DataInputBuffer();" and get no content,so it throws an EOFException.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              wenjun Wenjun Ruan
              Votes:
              5 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: