Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.19.0
    • Fix Version/s: 0.19.0
    • Component/s: conf, io
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Introduced support for bzip2 compressed files.

      Description

      Hadoop recognizes gzip compressed input and automatically decompresses the data before providing it to the mapper. But Hadoop can not split a gzip stream due to the very nature of the gzip compression. Consequently one gzip stream (e.g a whole file) can go to only one mapper. On the contrary Bzip2 compressed stream can be split across its block delimiters.

      We are interested in extending Hadoop to support splittable bzip2 with a codec. (https://issues.apache.org/jira/browse/HADOOP-1823 uses input reader to split the bzip2 files, which must be provided by the user and can handle FileInputFormat. If a user wants to use some other input format or wants to do custom record handling, he must write a new input reader!)

      We have a patch now that provides a basic bzip2 codec equivalent to the current gzip codec. We are in the process of extending that to support splitting.

      1. HADOOP-3646-version5.patch
        108 kB
        Abdul Qadeer
      2. HADOOP-3646-version4.patch
        109 kB
        Abdul Qadeer
      3. HADOOP-3646version3.patch
        107 kB
        Abdul Qadeer
      4. HADOOP-3646.patch
        134 kB
        Abdul Qadeer
      5. HADOOP-3646.patch
        130 kB
        Abdul Qadeer

        Issue Links

          Activity

          Hide
          Abdul Qadeer added a comment -

          Updated patch file.

          Show
          Abdul Qadeer added a comment - Updated patch file.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12385179/HADOOP-3646.patch
          against trunk revision 673517.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 6 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2790/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2790/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2790/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2790/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12385179/HADOOP-3646.patch against trunk revision 673517. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 6 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2790/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2790/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2790/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2790/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          A few nits:

          • Please remove the //............... and //*****// comments completely or replace them with meaningful javadoc
          • FeatureNotImplemented Exception can be replaced with UnsupportedOperationException
          • WrongBZip2Header can be replaced by an IOException with a meaningful description
          • Several close() methods contain commented-out code
          • Instead of returning null for methods like create(De|C)ompressor, throwing UnsupportedOperationException is preferred.
          • createInputStream and createOutputStream in Bzip2Codec should not simply ignore the Decompressor/Compressor, but throw an exception. This isn't really a problem- since one can't obtain these from the codec- but it's worth fixing.
          • createInputStream should not swallow the exception; changing WrongBZip2Header to an IOException should remove the need for this, anyway
          • The change to CompressionCodecFactory should be reverted; it's sufficient to add Bzip2Codec to the config

          I tried running wordcount on some sample bzip text and this worked perfectly.

          Show
          Chris Douglas added a comment - A few nits: Please remove the //............... and //*****// comments completely or replace them with meaningful javadoc FeatureNotImplemented Exception can be replaced with UnsupportedOperationException WrongBZip2Header can be replaced by an IOException with a meaningful description Several close() methods contain commented-out code Instead of returning null for methods like create(De|C)ompressor, throwing UnsupportedOperationException is preferred. createInputStream and createOutputStream in Bzip2Codec should not simply ignore the Decompressor/Compressor, but throw an exception. This isn't really a problem- since one can't obtain these from the codec- but it's worth fixing. createInputStream should not swallow the exception; changing WrongBZip2Header to an IOException should remove the need for this, anyway The change to CompressionCodecFactory should be reverted; it's sufficient to add Bzip2Codec to the config I tried running wordcount on some sample bzip text and this worked perfectly.
          Hide
          Abdul Qadeer added a comment -

          Updated patch to resolve the problems mentioned by Hudson patch verifier and Chris Douglas

          Show
          Abdul Qadeer added a comment - Updated patch to resolve the problems mentioned by Hudson patch verifier and Chris Douglas
          Hide
          Abdul Qadeer added a comment -

          Issues raised by Hudson are tried to be resolved.

          Show
          Abdul Qadeer added a comment - Issues raised by Hudson are tried to be resolved.
          Hide
          Chris Douglas added a comment -

          Something is preventing this from being marked PA, so I'm going to resolve and reopen it to see if that restores the normal workflow.

          Show
          Chris Douglas added a comment - Something is preventing this from being marked PA, so I'm going to resolve and reopen it to see if that restores the normal workflow.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12385894/HADOOP-3646.patch
          against trunk revision 677054.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2864/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2864/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2864/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2864/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12385894/HADOOP-3646.patch against trunk revision 677054. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2864/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2864/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2864/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2864/console This message is automatically generated.
          Hide
          Abdul Qadeer added a comment -

          This patch tries to correct bugs reported by findbug. I have left one warning un-resolved. The warning is "MS_OOI_PKGPROTECT:Field should be moved out of an interface and made package protected." This warning is arising from Ant BZip2 code. As discussed in https://issues.apache.org/jira/browse/HADOOP-1823 we are using this bzip2 code for short term. This warning along with all the splitting support requirements will be posted to Ant JIRA so that we could later remove this short term copy of bzip2 and import newer bzip2 code as an external jar file.

          Show
          Abdul Qadeer added a comment - This patch tries to correct bugs reported by findbug. I have left one warning un-resolved. The warning is "MS_OOI_PKGPROTECT:Field should be moved out of an interface and made package protected." This warning is arising from Ant BZip2 code. As discussed in https://issues.apache.org/jira/browse/HADOOP-1823 we are using this bzip2 code for short term. This warning along with all the splitting support requirements will be posted to Ant JIRA so that we could later remove this short term copy of bzip2 and import newer bzip2 code as an external jar file.
          Hide
          Abdul Qadeer added a comment -

          The patch is not appearing in Hudson running/pending list. So I am just trying to resubmit the patch.

          Show
          Abdul Qadeer added a comment - The patch is not appearing in Hudson running/pending list. So I am just trying to resubmit the patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12386279/HADOOP-3646version3.patch
          against trunk revision 677781.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2895/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2895/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2895/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2895/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12386279/HADOOP-3646version3.patch against trunk revision 677781. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2895/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2895/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2895/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2895/console This message is automatically generated.
          Hide
          Abdul Qadeer added a comment -

          Test cases included.

          Show
          Abdul Qadeer added a comment - Test cases included.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12386373/HADOOP-3646-version4.patch
          against trunk revision 677839.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2900/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2900/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2900/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2900/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12386373/HADOOP-3646-version4.patch against trunk revision 677839. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2900/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2900/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2900/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2900/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          This is looking very good, particularly as a first pass. There are only a few minor tweaks remaining:

          • The latest patch includes an artifact from the original, which employed a try-catch block:
              public CompressionInputStream createInputStream(InputStream in)
                  throws IOException {
                CompressionInputStream compressionInputStream = null;
                compressionInputStream = new BZip2CompressionInputStream(in);
                return compressionInputStream;
              }
            
          • createInputStream(InputStream, Decompressor) throws UnsupportedOperationException while createOutputStream(OutputStream, Compressor) ignores the second argument. These should be symmetric, particularly since one cannot create a valid Compressor or Decompressor. The latter should also throw.
          • The ERROR_MESSAGE ("Feature currently not supported") message around an UnsupportedOperationException doesn't add any information and may be omitted; similarly, not only is the error message from the BZip2CompressionInputStream and BZip2CompressionOutputStream unconventionally descriptive, the purpose of adding constructors that only throw UnsupportedOperationException from a private, static inner class is not clear to me. Is either necessary? Wouldn't it make more sense to add these when the associated Compressor and Decompressors are included?
          Show
          Chris Douglas added a comment - This is looking very good, particularly as a first pass. There are only a few minor tweaks remaining: The latest patch includes an artifact from the original, which employed a try-catch block: public CompressionInputStream createInputStream(InputStream in) throws IOException { CompressionInputStream compressionInputStream = null; compressionInputStream = new BZip2CompressionInputStream(in); return compressionInputStream; } createInputStream(InputStream, Decompressor) throws UnsupportedOperationException while createOutputStream(OutputStream, Compressor) ignores the second argument. These should be symmetric, particularly since one cannot create a valid Compressor or Decompressor. The latter should also throw. The ERROR_MESSAGE ("Feature currently not supported") message around an UnsupportedOperationException doesn't add any information and may be omitted; similarly, not only is the error message from the BZip2CompressionInputStream and BZip2CompressionOutputStream unconventionally descriptive, the purpose of adding constructors that only throw UnsupportedOperationException from a private, static inner class is not clear to me. Is either necessary? Wouldn't it make more sense to add these when the associated Compressor and Decompressors are included?
          Hide
          Abdul Qadeer added a comment -

          Issues resolved which were highlighted after the review.

          Show
          Abdul Qadeer added a comment - Issues resolved which were highlighted after the review.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12387075/HADOOP-3646-version5.patch
          against trunk revision 680577.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2967/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2967/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2967/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2967/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12387075/HADOOP-3646-version5.patch against trunk revision 680577. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2967/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2967/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2967/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2967/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          I just committed this. Thanks, Abdul

          Show
          Chris Douglas added a comment - I just committed this. Thanks, Abdul
          Hide
          Abdul Qadeer added a comment -

          Hi Chris,

          As I am writing code to support splitting for bzip2, I want to discuss these
          changes with watchers/developers for feedback. Should I re-open 3646 for it
          or make
          a new JIRA entry?

          Thanks,
          Abdul Qadeer

          Show
          Abdul Qadeer added a comment - Hi Chris, As I am writing code to support splitting for bzip2, I want to discuss these changes with watchers/developers for feedback. Should I re-open 3646 for it or make a new JIRA entry? Thanks, Abdul Qadeer
          Hide
          Chris Douglas added a comment -

          A new JIRA would be best. If you wanted to link the new issue to
          HADOOP-3646, that might be helpful, but it's not necessary. -C

          Show
          Chris Douglas added a comment - A new JIRA would be best. If you wanted to link the new issue to HADOOP-3646 , that might be helpful, but it's not necessary. -C
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )

            People

            • Assignee:
              Abdul Qadeer
              Reporter:
              Abdul Qadeer
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1,008h
                1,008h
                Remaining:
                Remaining Estimate - 1,008h
                1,008h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development