[HADOOP-6852] apparent bug in concatenated-bzip2 support (decoding) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.22.0
Fix Version/s: 3.1.0
Component/s: io
Labels:
None
Environment:

Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15

Description

The following simplified code (manually picked out of testMoreBzip2() in https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) triggers a "java.io.IOException: bad block header" in org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( CBZip2InputStream.java:527):

    JobConf jobConf = new JobConf(defaultConf);

    CompressionCodec bzip2 = new BZip2Codec();
    ReflectionUtils.setConf(bzip2, jobConf);
    localFs.delete(workDir, true);

    // copy multiple-member test file to HDFS
    String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension();
    Path fnLocal2 = new Path(System.getProperty("test.concat.data","/tmp"),fn2);
    Path fnHDFS2  = new Path(workDir, fn2);
    localFs.copyFromLocalFile(fnLocal2, fnHDFS2);

    FileInputFormat.setInputPaths(jobConf, workDir);

    final FileInputStream in2 = new FileInputStream(fnLocal2.toString());
    CompressionInputStream cin2 = bzip2.createInputStream(in2);
    LineReader in = new LineReader(cin2);
    Text out = new Text();

    int numBytes, totalBytes=0, lineNum=0;
    while ((numBytes = in.readLine(out)) > 0) {
      ++lineNum;
      totalBytes += numBytes;
    }
    in.close();

The specified file is also included in the H-6835 patch linked above, and some additional debug output is included in the commented-out test loop above. (Only in the linked, "v4" version of the patch, however--I'm about to remove the debug stuff for checkin.)

It's possible I've done something completely boneheaded here, but the file, at least, checks out in a subsequent set of subtests and with stock bzip2 itself. Only the code above is problematic; it reads through the first concatenated chunk (17 lines of text) just fine but chokes on the header of the second one. Altogether, the test file contains 84 lines of text and 4 concatenated bzip2 files.

(It's possible this is a mapreduce issue rather than common, but note that the identical gzip test works fine. Possibly it's related to the stream-vs-decompressor dichotomy, though; intentionally not supported?)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-6852.01.patch
05/Feb/18 13:32
27 kB
Zsolt Venczel
HADOOP-6852.02.patch
06/Feb/18 09:29
27 kB
Zsolt Venczel
HADOOP-6852.03.patch
06/Feb/18 14:15
29 kB
Zsolt Venczel
HADOOP-6852.04.patch
21/Feb/18 11:12
29 kB
Zsolt Venczel

Issue Links

is related to

HADOOP-6335 Support reading of concatenated gzip and bzip2 files

Resolved

relates to

HADOOP-6925 BZip2Codec incorrectly implements read()

Closed

links to

CDH-8864

How to merge 2 bzip2'ed files?

Activity

People

Assignee:: Zsolt Venczel

Reporter:: Greg Roelofs

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 07/Jul/10 22:25

Updated:: 19/Mar/18 18:32

Resolved:: 21/Feb/18 19:59