Hadoop Common
  1. Hadoop Common
  2. HADOOP-6817

SequenceFile.Reader can't read gzip format compressed sequence file, which produce by a mapreduce job, without native compression library

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 0.20.2
    • Fix Version/s: None
    • Component/s: io
    • Labels:
      None
    • Environment:

      Cluster:CentOS 5,jdk1.6.0_20
      Client:Mac SnowLeopard,jdk1.6.0_20

    • Tags:
      SequenceFile.Reader,Gzip

      Description

      An hadoop job output a gzip compressed sequence file(whether record compressed or block compressed).The client program use SequenceFile.Reader to read this sequence file,when reading the client program shows the following exceptions:

      2090 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      2091 [main] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
      Exception in thread "main" java.io.EOFException
      at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207)
      at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197)
      at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136)
      at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
      at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)
      at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92)
      at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101)
      at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:170)
      at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:180)
      at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520)
      at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
      at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
      at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
      at com.shiningware.intelligenceonline.taobao.mapreduce.HtmlContentSeqOutputView.main(HtmlContentSeqOutputView.java:28)

      I studied the code in org.apache.hadoop.io.SequenceFile.Reader.init method and read:
      // Initialize... not if this we are constructing a temporary Reader
      if (!tempReader) {
      valBuffer = new DataInputBuffer();
      if (decompress)

      { valDecompressor = CodecPool.getDecompressor(codec); valInFilter = codec.createInputStream(valBuffer, valDecompressor); valIn = new DataInputStream(valInFilter); }

      else

      { valIn = valBuffer; }

      the problem seems to be caused by "valBuffer = new DataInputBuffer();" ,because GzipCodec.createInputStream creates an instance of GzipInputStream whose constructor creates an instance of ResetableGZIPInputStream class.When ResetableGZIPInputStream's constructor calls it base class java.util.zip.GZIPInputStream's constructor ,it trys to read the empty "valBuffer = new DataInputBuffer();" and get no content,so it throws an EOFException.

        Issue Links

          Activity

          Hide
          Harsh J added a comment -

          Also Neils, If you notice the stack trace, the user ran into this cause his jobs did create it with the right codec, but his reader failed to enforce that the right codec is needed, which is what HADOOP-8582 wishes to fix. Or did I miss something?

          Show
          Harsh J added a comment - Also Neils, If you notice the stack trace, the user ran into this cause his jobs did create it with the right codec, but his reader failed to enforce that the right codec is needed, which is what HADOOP-8582 wishes to fix. Or did I miss something?
          Hide
          Harsh J added a comment -

          Hi Niels,

          Am happy to reopen this, but the reason for this not to work is explained at HADOOP-538.

          I will add that as a link as well.

          Do you still wish to reopen?

          Show
          Harsh J added a comment - Hi Niels, Am happy to reopen this, but the reason for this not to work is explained at HADOOP-538 . I will add that as a link as well. Do you still wish to reopen?
          Hide
          Niels Basjes added a comment -

          To me this seems NOT to be a duplicate of HADOOP-8582 .
          To me this issue is essentially: Problem with Gzip in specific situation.
          HADOOP-8582 effectively says "Lets make the error message clear until we fix the real problem".

          So I propose we keep this open as an unsolved 'non-duplicate' bug

          Show
          Niels Basjes added a comment - To me this seems NOT to be a duplicate of HADOOP-8582 . To me this issue is essentially: Problem with Gzip in specific situation. HADOOP-8582 effectively says "Lets make the error message clear until we fix the real problem". So I propose we keep this open as an unsolved 'non-duplicate' bug
          Hide
          Harsh J added a comment -

          This is being addressed via HADOOP-8582.

          Show
          Harsh J added a comment - This is being addressed via HADOOP-8582 .

            People

            • Assignee:
              Unassigned
              Reporter:
              Wenjun Huang
            • Votes:
              5 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development