Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-255

avoid bzip2 decompressor throwing exception on corrupted (prematurely truncated) file

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • None
    • None
    • None
    • None

    Description

      running map-reduce streaming job using the bzip2 compressor, job fails with one of either of the two following java exceptions:

      This seems to happen when one of the bz2 input files is corrupted (probably when the file is prematurely truncated). Example,

      Can we fix the bzip2 decompresser so that it does not throw the above two exceptions?

      2008-07-16 07:23:39,605 WARN org.apache.hadoop.mapred.TaskTracker: Error
      running child
      java.io.IOException: mark/reset not supported
      at java.io.InputStream.reset(InputStream.java:334)
      at
      org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.readLine(Bzip2TextInputFormat.java:117)

      at
      org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:140)

      at
      org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:34)

      at
      org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:158)
      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
      at
      org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

      or

      2008-07-16 20:49:28,020 WARN org.apache.hadoop.mapred.TaskTracker: Error
      running child
      java.io.IOException: CRC error
      at
      org.apache.tools.bzip2r.CBZip2InputStream.cadvise(CBZip2InputStream.java:74)
      at
      org.apache.tools.bzip2r.CBZip2InputStream.crcError(CBZip2InputStream.java:378)
      at
      org.apache.tools.bzip2r.CBZip2InputStream.endBlock(CBZip2InputStream.java:351)
      at
      org.apache.tools.bzip2r.CBZip2InputStream.setupNoRandPartA(CBZip2InputStream.java:851)
      at
      org.apache.tools.bzip2r.CBZip2InputStream.setupNoRandPartB(CBZip2InputStream.java:903)
      at
      org.apache.tools.bzip2r.CBZip2InputStream.read(CBZip2InputStream.java:240)
      at
      org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.readLine(Bzip2TextInputFormat.java:102)
      at
      org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:140)
      at
      org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:34)
      at
      org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:158)
      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
      at
      org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

      Example:
      $HADOOP_HOME/bin/hadoop jar -libjars $<path>/jars/bzip2.jar
      $HADOOP_HOME/hadoop-streaming.jar \
      -inputformat org.apache.hadoop.mapred.Bzip2TextInputFormat \
      -mapper "cat" \
      -reducer "cat" \
      -numReduceTasks 20 \
      -input '<path>/corrupt-data.bz2' \
      -output bzip2_bug_example \
      -jobconf stream.num.map.output.key.fields=1 \
      -jobconf stream.num.reduce.output.fields=1 \
      -jobconf num.key.fields.for.partition=1

      Attachments

        Activity

          People

            Unassigned Unassigned
            vgogate Suhas
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: