Hadoop Common
  1. Hadoop Common
  2. HADOOP-6852

apparent bug in concatenated-bzip2 support (decoding)

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.22.0
    • Fix Version/s: None
    • Component/s: io
    • Labels:
      None
    • Environment:

      Linux x86_64 running 32-bit Hadoop, JDK 1.6.0_15

      Description

      The following simplified code (manually picked out of testMoreBzip2() in https://issues.apache.org/jira/secure/attachment/12448272/HADOOP-6835.v4.trunk-hadoop-mapreduce.patch) triggers a "java.io.IOException: bad block header" in org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock( CBZip2InputStream.java:527):

          JobConf jobConf = new JobConf(defaultConf);
      
          CompressionCodec bzip2 = new BZip2Codec();
          ReflectionUtils.setConf(bzip2, jobConf);
          localFs.delete(workDir, true);
      
          // copy multiple-member test file to HDFS
          String fn2 = "testCompressThenConcat.txt" + bzip2.getDefaultExtension();
          Path fnLocal2 = new Path(System.getProperty("test.concat.data","/tmp"),fn2);
          Path fnHDFS2  = new Path(workDir, fn2);
          localFs.copyFromLocalFile(fnLocal2, fnHDFS2);
      
          FileInputFormat.setInputPaths(jobConf, workDir);
      
          final FileInputStream in2 = new FileInputStream(fnLocal2.toString());
          CompressionInputStream cin2 = bzip2.createInputStream(in2);
          LineReader in = new LineReader(cin2);
          Text out = new Text();
      
          int numBytes, totalBytes=0, lineNum=0;
          while ((numBytes = in.readLine(out)) > 0) {
            ++lineNum;
            totalBytes += numBytes;
          }
          in.close();
      

      The specified file is also included in the H-6835 patch linked above, and some additional debug output is included in the commented-out test loop above. (Only in the linked, "v4" version of the patch, however--I'm about to remove the debug stuff for checkin.)

      It's possible I've done something completely boneheaded here, but the file, at least, checks out in a subsequent set of subtests and with stock bzip2 itself. Only the code above is problematic; it reads through the first concatenated chunk (17 lines of text) just fine but chokes on the header of the second one. Altogether, the test file contains 84 lines of text and 4 concatenated bzip2 files.

      (It's possible this is a mapreduce issue rather than common, but note that the identical gzip test works fine. Possibly it's related to the stream-vs-decompressor dichotomy, though; intentionally not supported?)

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Greg theorized that HADOOP-6925 may fix this

          Show
          Todd Lipcon added a comment - Greg theorized that HADOOP-6925 may fix this
          Hide
          Greg Roelofs added a comment -

          Alas, it doesn't.

          Show
          Greg Roelofs added a comment - Alas, it doesn't.
          Hide
          Wouter de Bie added a comment -

          I'm having the same problem. We write files from our logging application using the BZip2Codec, but our application rotates files on an hourly basis. Next to that, when we flush log lines, we compress, which in turn causes different bzip2 block sizes.
          Anyway, we're also getting the 'bad block header' exception.
          Looking through some code and trying different things, we get things to work when we change the following in org.apache.hadoop.io.compress.bzip2.CBZip2InputStream:

            public CBZip2InputStream(final InputStream in) throws IOException {
              this(in, READ_MODE.CONTINUOUS);
            }
          

          to

            public CBZip2InputStream(final InputStream in) throws IOException {
              this(in, READ_MODE.BYBLOCK);
            }
          

          Which causes CBZip2InputStream to use the block size in initBlock(), instead of looking for the magic numbers.

          I'll be digging into some code more tomorrow, but to quickly move forward on this issue, I would like to know why is CONTINUOUS the default mode and is there a place where the read mode is determined? BZip2 is block based, so why not default to that?

          I'll also try the above piece of code and see if that works with BYBLOCK.

          Show
          Wouter de Bie added a comment - I'm having the same problem. We write files from our logging application using the BZip2Codec, but our application rotates files on an hourly basis. Next to that, when we flush log lines, we compress, which in turn causes different bzip2 block sizes. Anyway, we're also getting the 'bad block header' exception. Looking through some code and trying different things, we get things to work when we change the following in org.apache.hadoop.io.compress.bzip2.CBZip2InputStream: public CBZip2InputStream(final InputStream in) throws IOException { this(in, READ_MODE.CONTINUOUS); } to public CBZip2InputStream(final InputStream in) throws IOException { this(in, READ_MODE.BYBLOCK); } Which causes CBZip2InputStream to use the block size in initBlock(), instead of looking for the magic numbers. I'll be digging into some code more tomorrow, but to quickly move forward on this issue, I would like to know why is CONTINUOUS the default mode and is there a place where the read mode is determined? BZip2 is block based, so why not default to that? I'll also try the above piece of code and see if that works with BYBLOCK.
          Hide
          Todd Lipcon added a comment -

          Wouter: do you have HADOOP-6925 in your build? Worth trying that to see if your problem is the same as Greg's or the one we fixed in the other JIRA.

          Show
          Todd Lipcon added a comment - Wouter: do you have HADOOP-6925 in your build? Worth trying that to see if your problem is the same as Greg's or the one we fixed in the other JIRA.
          Hide
          Len Trigg added a comment -

          We have been using the ant-based bzip2 library for our project and needed to be able to decompress concatenated bzip files. After poking around we came across the hadoop extensions and immediately found that it did not function correctly due to this bug. Essentially when crossing block boundaries the skipToNextMarker method leaves the stream position at the end of the block delimiter, but initBlock expects to be at the beginning of the block delimiter. After looking at the poor structure of the initBlock method, and the thread-unsafety that has been introduced into this class with the numberOfBytesTillNextMarker() method, we decided to avoid the hadoop version of this class altogether.

          Show
          Len Trigg added a comment - We have been using the ant-based bzip2 library for our project and needed to be able to decompress concatenated bzip files. After poking around we came across the hadoop extensions and immediately found that it did not function correctly due to this bug. Essentially when crossing block boundaries the skipToNextMarker method leaves the stream position at the end of the block delimiter, but initBlock expects to be at the beginning of the block delimiter. After looking at the poor structure of the initBlock method, and the thread-unsafety that has been introduced into this class with the numberOfBytesTillNextMarker() method, we decided to avoid the hadoop version of this class altogether.
          Hide
          Iker Jimenez added a comment -

          This is happening to us too when we compress files with pbzip2.
          Any ETA for a fix?

          Show
          Iker Jimenez added a comment - This is happening to us too when we compress files with pbzip2. Any ETA for a fix?
          Hide
          Bob Tiernay added a comment -

          Just hit this bug. Is there any plans to fix this bug, or do we as a community need to add something to a project like Elephant Bird to circumvent the issue? Thanks.

          Show
          Bob Tiernay added a comment - Just hit this bug. Is there any plans to fix this bug, or do we as a community need to add something to a project like Elephant Bird to circumvent the issue? Thanks.

            People

            • Assignee:
              Unassigned
              Reporter:
              Greg Roelofs
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development