Hadoop Common
  1. Hadoop Common
  2. HADOOP-4012

Providing splitting support for bzip2 compressed files

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: io
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      BZip2 files can now be split.

      Description

      Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.

      BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).

      We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code.

      1. C4012-14.patch
        61 kB
        Chris Douglas
      2. C4012-13.patch
        52 kB
        Abdul Qadeer
      3. C4012-12.patch
        55 kB
        Chris Douglas
      4. Hadoop-4012-version11.patch
        51 kB
        Abdul Qadeer
      5. Hadoop-4012-version10.patch
        51 kB
        Abdul Qadeer
      6. Hadoop-4012-version9.patch
        70 kB
        Abdul Qadeer
      7. Hadoop-4012-version8.patch
        70 kB
        Abdul Qadeer
      8. Hadoop-4012-version7.patch
        67 kB
        Abdul Qadeer
      9. Hadoop-4012-version6.patch
        51 kB
        Abdul Qadeer
      10. Hadoop-4012-version5.patch
        51 kB
        Abdul Qadeer
      11. Hadoop-4012-version4.patch
        52 kB
        Abdul Qadeer
      12. Hadoop-4012-version3.patch
        52 kB
        Abdul Qadeer
      13. Hadoop-4012-version2.patch
        52 kB
        Abdul Qadeer
      14. Hadoop-4012-version1.patch
        51 kB
        Abdul Qadeer

        Issue Links

          Activity

          Hide
          Abdul Qadeer added a comment -

          Patch to add bzip2 splitting support.

          Show
          Abdul Qadeer added a comment - Patch to add bzip2 splitting support.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12394477/Hadoop-4012-version1.patch
          against trunk revision 719787.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3639/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3639/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3639/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3639/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12394477/Hadoop-4012-version1.patch against trunk revision 719787. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3639/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3639/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3639/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3639/console This message is automatically generated.
          Hide
          Matei Zaharia added a comment -

          I looked at the test failures and they do not seem to be related to this
          patch (they don't even run the fair scheduler).

          Show
          Matei Zaharia added a comment - I looked at the test failures and they do not seem to be related to this patch (they don't even run the fair scheduler).
          Hide
          Abdul Qadeer added a comment -

          The following change was done in this new patch. Before this change, getPos()
          was returning values one less than what it should be. Similarly available() method
          was returning -1 because the value of count becomes -1 at the end of the chunk.

          I hope it will resolve at least two of the 4 bugs.
          ===================================================================
          — src/core/org/apache/hadoop/fs/FSInputChecker.java (revision 720041)
          +++ src/core/org/apache/hadoop/fs/FSInputChecker.java (working copy)
          @@ -295,12 +295,12 @@

          @Override
          public synchronized long getPos() throws IOException

          { - return chunkPos-(count-pos); + return chunkPos-((count-pos)<0?0:count-pos); }

          @Override
          public synchronized int available() throws IOException

          { - return count-pos; + return (count-pos)<0?0:count-pos; }
          Show
          Abdul Qadeer added a comment - The following change was done in this new patch. Before this change, getPos() was returning values one less than what it should be. Similarly available() method was returning -1 because the value of count becomes -1 at the end of the chunk. I hope it will resolve at least two of the 4 bugs. =================================================================== — src/core/org/apache/hadoop/fs/FSInputChecker.java (revision 720041) +++ src/core/org/apache/hadoop/fs/FSInputChecker.java (working copy) @@ -295,12 +295,12 @@ @Override public synchronized long getPos() throws IOException { - return chunkPos-(count-pos); + return chunkPos-((count-pos)<0?0:count-pos); } @Override public synchronized int available() throws IOException { - return count-pos; + return (count-pos)<0?0:count-pos; }
          Hide
          Abdul Qadeer added a comment -

          Looks like Hudson is stuck on my patch. Should I cancel patch and re-submit?

          Show
          Abdul Qadeer added a comment - Looks like Hudson is stuck on my patch. Should I cancel patch and re-submit?
          Hide
          Abdul Qadeer added a comment -

          Since Hudson queue seems stuck, I am canceling this patch.

          Show
          Abdul Qadeer added a comment - Since Hudson queue seems stuck, I am canceling this patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12394526/Hadoop-4012-version2.patch
          against trunk revision 720162.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3644/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3644/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3644/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3644/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12394526/Hadoop-4012-version2.patch against trunk revision 720162. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3644/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3644/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3644/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3644/console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12394714/Hadoop-4012-version3.patch
          against trunk revision 720930.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3663/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3663/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3663/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3663/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12394714/Hadoop-4012-version3.patch against trunk revision 720930. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3663/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3663/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3663/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3663/console This message is automatically generated.
          Hide
          Abdul Qadeer added a comment -

          This patch fixes the warning found by findbug

          Show
          Abdul Qadeer added a comment - This patch fixes the warning found by findbug
          Hide
          Chris Douglas added a comment -

          The following change was done in this new patch. Before this change, getPos()
          was returning values one less than what it should be. Similarly available() method
          was returning -1 because the value of count becomes -1 at the end of the chunk.

          Should this change be part of a separate issue, then? I'm not sure what you mean by "two of the 4 bugs", but bug fixes shouldn't be part of large, new features if the fix is unaffected by the feature.

          • This modifies TestMultipleCacheFiles to append a newline at the end of the file. Why is this necessary? Is this the same problem as HADOOP-4182?
          • Pushing the READ_MODE abstraction (and the new createInputStream) into the CompressionCodec interface, particularly when only bzip supports it, is inappropriate. If it's applicable to codecs other than bzip, it should be a separate interface (extending CompressionCodec?). This would also let instanceof replace canDecompressSplitInput and move seekBackwards to the new interface. Can you describe what it means for a codec to implement this superset of functions?
          • This patch incorporates HADOOP-4010:
            -    while (pos < end) {
            +    // We always read one extra line, which lies outside the upper
            +    // split limit i.e. (end - 1)
            +    pos = this.getPos();
            +    
            +    while (pos <= end) {
            
            +    // If this is not the first split, we always throw away first record
            +    // because we always (except the last split) read one extra line in
            +    // next() method.
            

            Shouldn't this remain with the original JIRA? Are the issues raised there addressed in this patch?

          • Does this add the Seekable interface to CompressionInputStream only to support getPos() for LineRecordReader?

          This affects too many core components to make the feature freeze for 0.20 (Fri).

          Show
          Chris Douglas added a comment - The following change was done in this new patch. Before this change, getPos() was returning values one less than what it should be. Similarly available() method was returning -1 because the value of count becomes -1 at the end of the chunk. Should this change be part of a separate issue, then? I'm not sure what you mean by "two of the 4 bugs", but bug fixes shouldn't be part of large, new features if the fix is unaffected by the feature. This modifies TestMultipleCacheFiles to append a newline at the end of the file. Why is this necessary? Is this the same problem as HADOOP-4182 ? Pushing the READ_MODE abstraction (and the new createInputStream) into the CompressionCodec interface, particularly when only bzip supports it, is inappropriate. If it's applicable to codecs other than bzip, it should be a separate interface (extending CompressionCodec?). This would also let instanceof replace canDecompressSplitInput and move seekBackwards to the new interface. Can you describe what it means for a codec to implement this superset of functions? This patch incorporates HADOOP-4010 : - while (pos < end) { + // We always read one extra line, which lies outside the upper + // split limit i.e. (end - 1) + pos = this.getPos(); + + while (pos <= end) { + // If this is not the first split, we always throw away first record + // because we always (except the last split) read one extra line in + // next() method. Shouldn't this remain with the original JIRA? Are the issues raised there addressed in this patch? Does this add the Seekable interface to CompressionInputStream only to support getPos() for LineRecordReader? This affects too many core components to make the feature freeze for 0.20 (Fri).
          Hide
          Abdul Qadeer added a comment -

          The following change was done in this new patch. Before this change, getPos()
          was returning values one less than what it should be. Similarly available() method
          was returning -1 because the value of count becomes -1 at the end of the chunk.

          Should this change be part of a separate issue, then? I'm not sure what you mean by "two of the 4 bugs", but bug fixes shouldn't be part of large, new features if the fix is unaffected by the feature.


          Hudson informed about 4 core test case failures when it ran patch version 1. Two of them were arising due to my patch, while the other two were something else which automatically got away later. The new code in LineRecordReader now depends on getPos() method of FSDataInputStream to find the current stream position instead of keeping this position record itself. And getPos() was returning -1 value at the end, when in fact it should not have gone below 0. So I changed getPos() and available() methods in src/core/org/apache/hadoop/fs/FSInputChecker.java
          Since these changes were necessary for my code to work correctly, I included it here.

          This modifies TestMultipleCacheFiles to append a newline at the end of the file. Why is this necessary? Is this the same problem as HADOOP-4182?


          Yes it is the same issue as mentioned in Hadoop-4182.

          Pushing the READ_MODE abstraction (and the new createInputStream) into the CompressionCodec interface, particularly when only bzip supports it, is inappropriate. If it's applicable to codecs other than bzip, it should be a separate interface (extending CompressionCodec?). This would also let instanceof replace canDecompressSplitInput and move seekBackwards to the new interface. Can you describe what it means for a codec to implement this superset of functions?


          Amongst the currently supported codecs, I think only BZip2 can decompress block of a data. But I am not sure that BZip2 is the only one out there (ONly a compression expert can tell that which compression algorithms can decompress blocks of data without needing the whole compressed file) So READ_MODE abstration provides two modes of operation. The first one is "continous", where a codec must need the whole compressed
          file to decompress successfully. The other mode is "By_Block", where codecs can decompress parts of a compressed file and whole file is not needed. Hence such codecs can be used effectively with Hadoop kind of splitting.

          "canDecompressSplitInput" asks the codecs that can it decompress given a stream starting at some random position in a compressed file. Currently Hadoop assumes that compressed files must be handeled by one mapper because splitting is not possible. This assumption
          is not ture, atleast in the case of BZip2.

          "seekBackwards" was needed for the cases when Hadoop split start lies exactly on a BZip2 marker. These markers are identifications for the algorithm that a new block is starting. To work the bzip2 codec correctly, the stream needs to move back about as much as the size of the Marker. Ideally this should be done in codec class but buffering at different level, made it difficult to move the stream backwards sitting in the
          CBZip2InputStream. So "seekBackwards" means that in LineRecordReader, Hadoop asks the codecs that does it need to move back the stream.

          Making a separate interface for BZip2 style of codec is another possibility. I had two options, that either to change the current interfaces or make a new one. I took the first option leaving the final decision to the Hadoop committers. But changing it as suggested would mean a lot of changes and testing for me. But if committers think that, it must be changed and can not be accepted like this then I will change it as suggested.

          This patch incorporates HADOOP-4010: Shouldn't this remain with the original JIRA? Are the issues raised there addressed in this patch?


          Yes the code is the same as submitted in Hadoop-4010. I think the issues mentioned there does not affect the patch and it looks fine. I posted it there but no one later commented on it. So i had to include it here as well for my patch to work.

          Does this add the Seekable interface to CompressionInputStream only to support getPos() for LineRecordReader?


          Yes that is right.

          This affects too many core components to make the feature freeze for 0.20 (Fri)


          I think if release 0.20 is planned soon, we can change its effected version to next one.

          Show
          Abdul Qadeer added a comment - The following change was done in this new patch. Before this change, getPos() was returning values one less than what it should be. Similarly available() method was returning -1 because the value of count becomes -1 at the end of the chunk. Should this change be part of a separate issue, then? I'm not sure what you mean by "two of the 4 bugs", but bug fixes shouldn't be part of large, new features if the fix is unaffected by the feature. Hudson informed about 4 core test case failures when it ran patch version 1. Two of them were arising due to my patch, while the other two were something else which automatically got away later. The new code in LineRecordReader now depends on getPos() method of FSDataInputStream to find the current stream position instead of keeping this position record itself. And getPos() was returning -1 value at the end, when in fact it should not have gone below 0. So I changed getPos() and available() methods in src/core/org/apache/hadoop/fs/FSInputChecker.java Since these changes were necessary for my code to work correctly, I included it here. This modifies TestMultipleCacheFiles to append a newline at the end of the file. Why is this necessary? Is this the same problem as HADOOP-4182 ? Yes it is the same issue as mentioned in Hadoop-4182. Pushing the READ_MODE abstraction (and the new createInputStream) into the CompressionCodec interface, particularly when only bzip supports it, is inappropriate. If it's applicable to codecs other than bzip, it should be a separate interface (extending CompressionCodec?). This would also let instanceof replace canDecompressSplitInput and move seekBackwards to the new interface. Can you describe what it means for a codec to implement this superset of functions? Amongst the currently supported codecs, I think only BZip2 can decompress block of a data. But I am not sure that BZip2 is the only one out there (ONly a compression expert can tell that which compression algorithms can decompress blocks of data without needing the whole compressed file) So READ_MODE abstration provides two modes of operation. The first one is "continous", where a codec must need the whole compressed file to decompress successfully. The other mode is "By_Block", where codecs can decompress parts of a compressed file and whole file is not needed. Hence such codecs can be used effectively with Hadoop kind of splitting. "canDecompressSplitInput" asks the codecs that can it decompress given a stream starting at some random position in a compressed file. Currently Hadoop assumes that compressed files must be handeled by one mapper because splitting is not possible. This assumption is not ture, atleast in the case of BZip2. "seekBackwards" was needed for the cases when Hadoop split start lies exactly on a BZip2 marker. These markers are identifications for the algorithm that a new block is starting. To work the bzip2 codec correctly, the stream needs to move back about as much as the size of the Marker. Ideally this should be done in codec class but buffering at different level, made it difficult to move the stream backwards sitting in the CBZip2InputStream. So "seekBackwards" means that in LineRecordReader, Hadoop asks the codecs that does it need to move back the stream. Making a separate interface for BZip2 style of codec is another possibility. I had two options, that either to change the current interfaces or make a new one. I took the first option leaving the final decision to the Hadoop committers. But changing it as suggested would mean a lot of changes and testing for me. But if committers think that, it must be changed and can not be accepted like this then I will change it as suggested. This patch incorporates HADOOP-4010 : Shouldn't this remain with the original JIRA? Are the issues raised there addressed in this patch? Yes the code is the same as submitted in Hadoop-4010. I think the issues mentioned there does not affect the patch and it looks fine. I posted it there but no one later commented on it. So i had to include it here as well for my patch to work. Does this add the Seekable interface to CompressionInputStream only to support getPos() for LineRecordReader? Yes that is right. This affects too many core components to make the feature freeze for 0.20 (Fri) I think if release 0.20 is planned soon, we can change its effected version to next one.
          Hide
          Chris Douglas added a comment -

          Hudson informed about 4 core test case failures when it ran patch version 1. Two of them were arising due to my patch, while the other two were something else which automatically got away later. [snip]

          Ah, I see. Sorry, I hadn't understood.

          Yes the code is the same as submitted in Hadoop-4010. I think the issues mentioned there does not affect the patch and it looks fine. I posted it there but no one later commented on it. So i had to include it here as well for my patch to work.

          Marking this issue as depending on or blocked by the related issues, and raising them as PA again, is better than creating a composite patch. If you feel you've addressed the points made by the commenter, it's appropriate to resubmit. There's a lot of traffic, and some responses will inevitably be overlooked.

          Amongst the currently supported codecs, I think only BZip2 can decompress block of a data. But I am not sure that BZip2 is the only one out there

          LZO used to be a counterexample, but it has since been removed (HADOOP-4874, also why the current patch no longer applies). It doesn't support seeking from arbitrary offsets, though. In any case: even recognizing the development overhead, adding the new methods to the existing interfaces seems like the more invasive approach, particularly since bzip is the only implementer. Given that LineRecordReader, NLineInputFormat, LineReader, and FSInputChecker are all heavily used, the prospect of accidentally destabilizing them to support an optional piece of functionality is unnerving. Are there any specific reasons why the existing approach is preferred? If one were to compress binary data with the bzip codec, would it be splittable and/or readable?

          Thanks for your patience with this.

          Show
          Chris Douglas added a comment - Hudson informed about 4 core test case failures when it ran patch version 1. Two of them were arising due to my patch, while the other two were something else which automatically got away later. [snip] Ah, I see. Sorry, I hadn't understood. Yes the code is the same as submitted in Hadoop-4010. I think the issues mentioned there does not affect the patch and it looks fine. I posted it there but no one later commented on it. So i had to include it here as well for my patch to work. Marking this issue as depending on or blocked by the related issues, and raising them as PA again, is better than creating a composite patch. If you feel you've addressed the points made by the commenter, it's appropriate to resubmit. There's a lot of traffic, and some responses will inevitably be overlooked. Amongst the currently supported codecs, I think only BZip2 can decompress block of a data. But I am not sure that BZip2 is the only one out there LZO used to be a counterexample, but it has since been removed ( HADOOP-4874 , also why the current patch no longer applies). It doesn't support seeking from arbitrary offsets, though. In any case: even recognizing the development overhead, adding the new methods to the existing interfaces seems like the more invasive approach, particularly since bzip is the only implementer. Given that LineRecordReader, NLineInputFormat, LineReader, and FSInputChecker are all heavily used, the prospect of accidentally destabilizing them to support an optional piece of functionality is unnerving. Are there any specific reasons why the existing approach is preferred? If one were to compress binary data with the bzip codec, would it be splittable and/or readable? Thanks for your patience with this.
          Hide
          Zheng Shao added a comment -

          I like the idea of avoiding seeking back 1 byte in LineRecordReader. With this change, each mapper will usually open only 2 data blocks, while the old approach makes each mapper open 3 data blocks. Considering all the costs about it (opening TCP connection to the remote data node, disk seek to read 1 byte, etc) this seems to be a reasonable improvement.

          Show
          Zheng Shao added a comment - I like the idea of avoiding seeking back 1 byte in LineRecordReader. With this change, each mapper will usually open only 2 data blocks, while the old approach makes each mapper open 3 data blocks. Considering all the costs about it (opening TCP connection to the remote data node, disk seek to read 1 byte, etc) this seems to be a reasonable improvement.
          Hide
          Abdul Qadeer added a comment -

          Chris Douglas:

          I understand that any change to e.g. LineRecordReader (LRR) should be carried out with care because it the most frequently used code. It might not have been impossible to implement bzip2 codecs with the older implementation of LRR but with performance penalty for bzip2 (because it then had to decompress one more block per mapper) and the code in the LRR constructor and the next() method would have been a mess.

          HADOOP-4010 was separately created before finishing bzip2 work but unfortunately that couldn't get in. Owen O'Malley mentioned some concerns , to which I provided some logical hand waiving. I guess Owen has been busy or is not convinced with those arguments but he didn't said anything. I plan to ask him again that should I provide some kind of test cases just to make sure that his concerns are addressed.

          So going systematically I first plan to get 4010 approved and then move on by doing recommended changes. I have successfully used the bzip2 codecs for a binary compressed bzip2 files. So I hope by going step by step, I will be able to get this work approved.

          Show
          Abdul Qadeer added a comment - Chris Douglas: I understand that any change to e.g. LineRecordReader (LRR) should be carried out with care because it the most frequently used code. It might not have been impossible to implement bzip2 codecs with the older implementation of LRR but with performance penalty for bzip2 (because it then had to decompress one more block per mapper) and the code in the LRR constructor and the next() method would have been a mess. HADOOP-4010 was separately created before finishing bzip2 work but unfortunately that couldn't get in. Owen O'Malley mentioned some concerns , to which I provided some logical hand waiving. I guess Owen has been busy or is not convinced with those arguments but he didn't said anything. I plan to ask him again that should I provide some kind of test cases just to make sure that his concerns are addressed. So going systematically I first plan to get 4010 approved and then move on by doing recommended changes. I have successfully used the bzip2 codecs for a binary compressed bzip2 files. So I hope by going step by step, I will be able to get this work approved.
          Hide
          Abdul Qadeer added a comment -

          This patch incorporates changes suggested by Chris Douglas.

          Show
          Abdul Qadeer added a comment - This patch incorporates changes suggested by Chris Douglas.
          Hide
          Suhas Gogate added a comment -

          support for concatenated bzip2 compressed file. Throw error until this new feature supported to let user know about incompatible input.

          Show
          Suhas Gogate added a comment - support for concatenated bzip2 compressed file. Throw error until this new feature supported to let user know about incompatible input.
          Hide
          Abdul Qadeer added a comment -

          Now BZip2 can be invoked in either of the two modes;

          CONTINOUS
          BYBLOCK

          BYBLOCK is the mode which makes it possible to work with Hadoop Split input. And it should work fine for concatenation of many BZip2 compressed files as well. Otherwise in normal old mode (i.e CONTINOUS) its behavior is the same as it used to be. Currently the default mode is set to BYBLOCK. So problem mentioned in HADOOP-5601 should be solved.

          Show
          Abdul Qadeer added a comment - Now BZip2 can be invoked in either of the two modes; CONTINOUS BYBLOCK BYBLOCK is the mode which makes it possible to work with Hadoop Split input. And it should work fine for concatenation of many BZip2 compressed files as well. Otherwise in normal old mode (i.e CONTINOUS) its behavior is the same as it used to be. Currently the default mode is set to BYBLOCK. So problem mentioned in HADOOP-5601 should be solved.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12404241/Hadoop-4012-version5.patch
          against trunk revision 761082.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 release audit. The applied patch generated 663 release audit warnings (more than the trunk's current 661 warnings).

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/100/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/100/artifact/trunk/current/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/100/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/100/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/100/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12404241/Hadoop-4012-version5.patch against trunk revision 761082. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 release audit. The applied patch generated 663 release audit warnings (more than the trunk's current 661 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/100/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/100/artifact/trunk/current/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/100/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/100/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/100/console This message is automatically generated.
          Hide
          Abdul Qadeer added a comment -

          Hudson's complaints about findbugs and release audit warnings are taken care of in this new patch.

          Show
          Abdul Qadeer added a comment - Hudson's complaints about findbugs and release audit warnings are taken care of in this new patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12404711/Hadoop-4012-version6.patch
          against trunk revision 762216.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 release audit. The applied patch generated 662 release audit warnings (more than the trunk's current 659 warnings).

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/154/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/154/artifact/trunk/current/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/154/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/154/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/154/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12404711/Hadoop-4012-version6.patch against trunk revision 762216. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 release audit. The applied patch generated 662 release audit warnings (more than the trunk's current 659 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/154/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/154/artifact/trunk/current/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/154/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/154/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/154/console This message is automatically generated.
          Hide
          Abdul Qadeer added a comment -

          One of my test case, which makes different length data, compresses and decompresses at different split points depends on HADOOP-5213. HADOOP-5213's patch is currently waiting Hudson run. As HADOOP-5213 is committed, I will upload the new patch for HADOOP-4012

          Show
          Abdul Qadeer added a comment - One of my test case, which makes different length data, compresses and decompresses at different split points depends on HADOOP-5213 . HADOOP-5213 's patch is currently waiting Hudson run. As HADOOP-5213 is committed, I will upload the new patch for HADOOP-4012
          Hide
          Abdul Qadeer added a comment -

          Patch was no appearing in Hudson queue so I am trying to re-submit the patch.

          Show
          Abdul Qadeer added a comment - Patch was no appearing in Hudson queue so I am trying to re-submit the patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12408129/Hadoop-4012-version7.patch
          against trunk revision 776148.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 3 new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 release audit. The applied patch generated 494 release audit warnings (more than the trunk's current 486 warnings).

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/356/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/356/artifact/trunk/current/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/356/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/356/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/356/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12408129/Hadoop-4012-version7.patch against trunk revision 776148. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 release audit. The applied patch generated 494 release audit warnings (more than the trunk's current 486 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/356/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/356/artifact/trunk/current/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/356/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/356/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/356/console This message is automatically generated.
          Hide
          Abdul Qadeer added a comment -

          Looks like there is some problem with Hudson, atleast as far as core and contrib test are concerned. If you see Hudson queue (http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/) you will see that many JIRAs are ending up with the same 4 or 5 test case failures. e.g.

          http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/362/#showFailuresLink

          http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/361/#showFailuresLink

          http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/360/#showFailuresLink

          http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/358/#showFailuresLink

          Similarly though Hudson complains about test failures but does not list the failed tests (like in this JIRA's case and there are many others)

          I will test my patch on a local box and will post the results here.

          Show
          Abdul Qadeer added a comment - Looks like there is some problem with Hudson, atleast as far as core and contrib test are concerned. If you see Hudson queue ( http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/ ) you will see that many JIRAs are ending up with the same 4 or 5 test case failures. e.g. http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/362/#showFailuresLink http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/361/#showFailuresLink http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/360/#showFailuresLink http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/358/#showFailuresLink Similarly though Hudson complains about test failures but does not list the failed tests (like in this JIRA's case and there are many others) I will test my patch on a local box and will post the results here.
          Hide
          Abdul Qadeer added a comment -

          This patch fixes audit release and find bug warnings.

          Show
          Abdul Qadeer added a comment - This patch fixes audit release and find bug warnings.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12409127/Hadoop-4012-version8.patch
          against trunk revision 779807.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 release audit. The applied patch generated 501 release audit warnings (more than the trunk's current 493 warnings).

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/422/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/422/artifact/trunk/current/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/422/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/422/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/422/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409127/Hadoop-4012-version8.patch against trunk revision 779807. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 release audit. The applied patch generated 501 release audit warnings (more than the trunk's current 493 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/422/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/422/artifact/trunk/current/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/422/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/422/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/422/console This message is automatically generated.
          Hide
          Yuri Pradkin added a comment -

          OK, it looks like this patch is in reasonable shape now. The failing tests seem to be the ones failing for everybody else.
          We've been using an older version of the patch for some time now with a custom input format and have had consistent results.
          Not to mention the speedup due to parallelism.

          Can one of the committers please have a look?

          Show
          Yuri Pradkin added a comment - OK, it looks like this patch is in reasonable shape now. The failing tests seem to be the ones failing for everybody else. We've been using an older version of the patch for some time now with a custom input format and have had consistent results. Not to mention the speedup due to parallelism. Can one of the committers please have a look?
          Hide
          Abdul Qadeer added a comment -

          1) Patch to fix audit release warnings.

          Show
          Abdul Qadeer added a comment - 1) Patch to fix audit release warnings.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12409554/Hadoop-4012-version9.patch
          against trunk revision 781602.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 release audit. The applied patch generated 500 release audit warnings (more than the trunk's current 492 warnings).

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/459/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/459/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/459/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/459/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/459/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409554/Hadoop-4012-version9.patch against trunk revision 781602. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 release audit. The applied patch generated 500 release audit warnings (more than the trunk's current 492 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/459/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/459/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/459/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/459/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/459/console This message is automatically generated.
          Hide
          Abdul Qadeer added a comment -

          (1) Running test-patch target on my local box does not produce release audit warnings:

          [exec]
          [exec] There appear to be 503 release audit warnings before the patch and 501 release audit warnings after applying the patch.
          [exec]
          [exec]
          [exec]
          [exec]
          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          [exec]

          (2) Test org.apache.hadoop.hdfs.server.datanode.TestBlockReplacement.testBlockReplacement does not fail on my local box and
          org.apache.hadoop.mapred.TestQueueCapacities.testMultipleQueues produces an error (But this is the same even for the committed SVN Code).

          These 1 or 2 errors are occurring for other patches as well on Hudson. Additionally they are un-related as far as bzip2 is concerned.

          Can the Hadoop committers please review the patch so that I can complete this work. Thank you.

          Show
          Abdul Qadeer added a comment - (1) Running test-patch target on my local box does not produce release audit warnings: [exec] [exec] There appear to be 503 release audit warnings before the patch and 501 release audit warnings after applying the patch. [exec] [exec] [exec] [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] (2) Test org.apache.hadoop.hdfs.server.datanode.TestBlockReplacement.testBlockReplacement does not fail on my local box and org.apache.hadoop.mapred.TestQueueCapacities.testMultipleQueues produces an error (But this is the same even for the committed SVN Code). These 1 or 2 errors are occurring for other patches as well on Hudson. Additionally they are un-related as far as bzip2 is concerned. Can the Hadoop committers please review the patch so that I can complete this work. Thank you.
          Hide
          Chris Douglas added a comment -
          • If an IOException is ignored:
            +    if (!(in instanceof Seekable) || !(in instanceof PositionedReadable)) {
            +      try {
            +        this.maxAvailableData = in.available();
            +      } catch (IOException e) {
            +      }
            +    }
            

            Please include a comment explaining why it should be ignored. Won't this fail in the first call to getPos() if it doesn't throw in the cstr?

          • This change to TestCodec looks suspect:
            -    SequenceFile.Reader reader = new SequenceFile.Reader(fs, filePath, conf);
            -    
            +    SequenceFile.Reader reader = null;
            +    try{
            +    reader = new SequenceFile.Reader(fs, filePath, conf);
            +    }
            +    catch(Exception exp){}
            
          • Instead of the tenary operatior in FSInputChecker, Math.max(0L, count - pos) is more readable.
          • With changes such as the following, will the case resolved in HADOOP-3144 continue to work in LineRecordReader?
            -      int newSize = in.readLine(value, maxLineLength,
            -                                Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
            -                                         maxLineLength));
            +      int newSize = lineReader.readLine(value, maxLineLength, Integer.MAX_VALUE);
            
          • Specifying the constant as 1L should avoid this:
            -    return (bsBuffShadow >> (bsLiveShadow - n)) & ((1 << n) - 1);
            +    final long one = 1;
            +    return (bsBuffShadow >> (bsLiveShadow - n)) & ((one << n) - 1);
            
          • The InputStreamCreationResultant struct seems unnecessary, but perhaps I'm not following the logic there. Of the three auxiliary params returned, end is unmodified from the actual and areLimitsChanged only avoids a redundant assignment to start. Since CompressionInputStream now implements Seekable, can this be replaced with start = in.getPos()?
          • CBZip2InputStream::read() shouldn't allocate a new byte[] for every read, but reuse an instance var. (0xFF also works, btw)
          Show
          Chris Douglas added a comment - If an IOException is ignored: + if (!(in instanceof Seekable) || !(in instanceof PositionedReadable)) { + try { + this.maxAvailableData = in.available(); + } catch (IOException e) { + } + } Please include a comment explaining why it should be ignored. Won't this fail in the first call to getPos() if it doesn't throw in the cstr? This change to TestCodec looks suspect: - SequenceFile.Reader reader = new SequenceFile.Reader(fs, filePath, conf); - + SequenceFile.Reader reader = null; + try{ + reader = new SequenceFile.Reader(fs, filePath, conf); + } + catch(Exception exp){} Instead of the tenary operatior in FSInputChecker, Math.max(0L, count - pos) is more readable. With changes such as the following, will the case resolved in HADOOP-3144 continue to work in LineRecordReader ? - int newSize = in.readLine(value, maxLineLength, - Math.max((int)Math.min(Integer.MAX_VALUE, end-pos), - maxLineLength)); + int newSize = lineReader.readLine(value, maxLineLength, Integer.MAX_VALUE); Specifying the constant as 1L should avoid this: - return (bsBuffShadow >> (bsLiveShadow - n)) & ((1 << n) - 1); + final long one = 1; + return (bsBuffShadow >> (bsLiveShadow - n)) & ((one << n) - 1); The InputStreamCreationResultant struct seems unnecessary, but perhaps I'm not following the logic there. Of the three auxiliary params returned, end is unmodified from the actual and areLimitsChanged only avoids a redundant assignment to start . Since CompressionInputStream now implements Seekable , can this be replaced with start = in.getPos() ? CBZip2InputStream::read() shouldn't allocate a new byte[] for every read, but reuse an instance var. ( 0xFF also works, btw)
          Hide
          Abdul Qadeer added a comment -

          The InputStreamCreationResultant struct seems unnecessary, but perhaps I'm not following the logic there. Of the three auxiliary params returned, end is unmodified from the actual and areLimitsChanged only avoids a redundant assignment to start. Since CompressionInputStream now implements Seekable, can this be replaced with start = in.getPos()?


          (1) I can not use start = getPos() in the LineRecordReader's constructor. The reason is that, for the BZip2 compressed data getPos() does not return the actual stream value. It is manipulated such a way in BZip2Codecs that, this hack works with LineRecordReader's start / end / pos stuff. On the other hand I need to do stuff (e.g. throwing away a line if it is not the first split) which required accurate value of start. So two pieces of information from the method createInputStream(...) were required:

          a) the new value of start
          b) the input stream

          The parameter "end" was also passed just in case some other future codec had the flexibility to change start or end or both.

          So I made that InputStreamCreationResultant because I wanted to return more than one things. To avoid making a new class, I can use some array which has inputstream on its 0th location and then the parameter start. But that will not be type safe.

          Another option might be to add another method in the SplitEnabledCompressionCodec to ask for the changed start value. So I will need to call createInputStream(...) and then call the new method e.g. getStart(). But the semantics of such a call were not neat and clear.

          Do we have any other option to avoid this "InputStreamCreationResultant"?


          (2) I have fixed the rest of the comments. Once this point is clear, I will prepare the final patch.

          Show
          Abdul Qadeer added a comment - The InputStreamCreationResultant struct seems unnecessary, but perhaps I'm not following the logic there. Of the three auxiliary params returned, end is unmodified from the actual and areLimitsChanged only avoids a redundant assignment to start. Since CompressionInputStream now implements Seekable, can this be replaced with start = in.getPos()? (1) I can not use start = getPos() in the LineRecordReader's constructor. The reason is that, for the BZip2 compressed data getPos() does not return the actual stream value. It is manipulated such a way in BZip2Codecs that, this hack works with LineRecordReader's start / end / pos stuff. On the other hand I need to do stuff (e.g. throwing away a line if it is not the first split) which required accurate value of start. So two pieces of information from the method createInputStream(...) were required: a) the new value of start b) the input stream The parameter "end" was also passed just in case some other future codec had the flexibility to change start or end or both. So I made that InputStreamCreationResultant because I wanted to return more than one things. To avoid making a new class, I can use some array which has inputstream on its 0th location and then the parameter start. But that will not be type safe. Another option might be to add another method in the SplitEnabledCompressionCodec to ask for the changed start value. So I will need to call createInputStream(...) and then call the new method e.g. getStart(). But the semantics of such a call were not neat and clear. Do we have any other option to avoid this "InputStreamCreationResultant"? (2) I have fixed the rest of the comments. Once this point is clear, I will prepare the final patch.
          Hide
          Yuri Pradkin added a comment -

          Abdul, I think at this point it's OK to assume that Chris has accepted your explanation. Can you please post your latest patch with all the fixes. Beware that repository has changed: you'll need to do switch to
          http://svn.apache.org/repos/asf/hadoop/common/trunk/ or checkout a fresh copy and reapply your patch.

          Guys, can we please wrap this up? It's taking way too long for this patch to make it. Please. Thanks.

          Show
          Yuri Pradkin added a comment - Abdul, I think at this point it's OK to assume that Chris has accepted your explanation. Can you please post your latest patch with all the fixes. Beware that repository has changed: you'll need to do switch to http://svn.apache.org/repos/asf/hadoop/common/trunk/ or checkout a fresh copy and reapply your patch. Guys, can we please wrap this up? It's taking way too long for this patch to make it. Please. Thanks.
          Hide
          Owen O'Malley added a comment -

          -1 to changing LineRecordReader. In particular, you've undone changes that were made by other jiras. This is very very touchy code that the current version of this patch breaks.

          I really think that this is better done by using a separate BzipTextInputFormat and BzipLineRecordReader.

          Show
          Owen O'Malley added a comment - -1 to changing LineRecordReader. In particular, you've undone changes that were made by other jiras. This is very very touchy code that the current version of this patch breaks. I really think that this is better done by using a separate BzipTextInputFormat and BzipLineRecordReader.
          Hide
          Yuri Pradkin added a comment -

          -1 to changing LineRecordReader. In particular, you've undone changes that were made by other jiras. This is very very touchy code that the current version of this patch breaks.

          Can you please be more specific: what changes were undone? Can you please elaborate on what exactly this patch breaks (perhaps we need more tests?)?

          The code indeed is very heavily used, but I think it's also very well beaten upon by various tests.

          I really think that this is better done by using a separate BzipTextInputFormat and BzipLineRecordReader.

          I think the point of having splitting built in - is that all readers/formats can avoid re-implementing common things. We are currently using a binary format reader that works just fine with this patch with only minor changes. Moreover, the idea is to work out a common framework where other block-compressed formats could be processed in a similar manner. The alternative that you're suggesting is to have BzipTextInputFormat, XXXBlockCompressInputFormat, YYYBlockCompressInputFormat, and so on.

          Show
          Yuri Pradkin added a comment - -1 to changing LineRecordReader. In particular, you've undone changes that were made by other jiras. This is very very touchy code that the current version of this patch breaks. Can you please be more specific: what changes were undone? Can you please elaborate on what exactly this patch breaks (perhaps we need more tests?)? The code indeed is very heavily used, but I think it's also very well beaten upon by various tests. I really think that this is better done by using a separate BzipTextInputFormat and BzipLineRecordReader. I think the point of having splitting built in - is that all readers/formats can avoid re-implementing common things. We are currently using a binary format reader that works just fine with this patch with only minor changes. Moreover, the idea is to work out a common framework where other block-compressed formats could be processed in a similar manner. The alternative that you're suggesting is to have BzipTextInputFormat, XXXBlockCompressInputFormat, YYYBlockCompressInputFormat, and so on.
          Hide
          Owen O'Malley added a comment -

          Can you please be more specific: what changes were undone?

          It was the feature from HADOOP-3144 that was dropped by mistake. But that is just indicative of the problem. LineRecordReader is very delicate code that is very easy to break. Changes to this code are hard to verify and it is used by many many Hadoop applications.

          I think the point of having splitting built in - is that all readers/formats can avoid re-implementing common things.

          This patch only seems to support TextInputFormat and Bzip. I don't see abstractions that would make this portable to other codecs or input formats. Maybe combining your binary format with BzipTextInputFormat makes sense? For binary files, SequenceFiles is already splittable with bzip and no changes.

          Show
          Owen O'Malley added a comment - Can you please be more specific: what changes were undone? It was the feature from HADOOP-3144 that was dropped by mistake. But that is just indicative of the problem. LineRecordReader is very delicate code that is very easy to break. Changes to this code are hard to verify and it is used by many many Hadoop applications. I think the point of having splitting built in - is that all readers/formats can avoid re-implementing common things. This patch only seems to support TextInputFormat and Bzip. I don't see abstractions that would make this portable to other codecs or input formats. Maybe combining your binary format with BzipTextInputFormat makes sense? For binary files, SequenceFiles is already splittable with bzip and no changes.
          Hide
          Abdul Qadeer added a comment -

          I totally agree with Owen that LineRecordReader (LRR) is a heavily used code in Hadoop and changes to such a code should be very carefully done. But that doesn't mean that, LRR code is closed for any improvements and feature enhancements. This LRR deals with Text data and Hadoop uses it as a default because probably Text is the most frequently used kind on input. I think same is true for BZip2 compressed files. If not all, most of them are Text compressed data. This patch provides the following feature to the end user.

          User puts his BZip2 compressed text files in an input directory and submits the Hadoop job. Soon he gets the result in the output directory. That is it!

          He hadn't had to write or mention

          any input format
          any record reader
          And all this happens using full cluster CPU power (due to BZip2 splitting).
          Also our algorithm doesn't demand any specific kind of splits from Hadoop. It works with whatever splits provided to it.
          .............................................................................

          Now comes the correctness of LRR code.

          testFormat() test case in org.apache.hadoop.mapred.TestTextInputFormat is a very stringent test case to ensure the correctness of LRR. And since this test case is frequently run whenever someone submits and tests a patch, so I dont think that any correctness problem in LRR can escape.

          Now comes the things like HADOOP-3114. That was my mistake as mentioned by Chris also in his comments. I have corrected that and will upload the new patch soon after running the tests locally. I think these are the places where the vision and knowledge of Hadoop commmiters comes handy. If you see that other than HADOOP-3114, I have missed something else, please tell me that and I will fix it. Additionally Hadoop has a very active user community and any latent bug in the code (especially the one used heavily) can not hide for long. Also the serious businesses using Hadoop usually use only the stable version (e.g I think Yahoo yet uses 0.18? while 0.20 is out).
          So I see a safe transition to the changed LRR code.

          ........................................................
          Now comes the point of making a new BZip2 specific input format / record reader.

          If we make a separate reader for BZip2 e.g BZip2LineRecordReader, in that class most of the code will be the replica of LRR. And doing that way, GZip support should also come out of LRR and there will be a GZipLineRecordReader, again mostly the code a replica of LRR. Whenever there comes a new codec in Hadoop we will make a new reader whose 99% code is the replica of LRR. So in my view it makes more sense to handle Line related text (be it plain or compressed) in LRR.

          When we were adding splitting support for BZip2, we felt that there might be situations when codecs want to change/manipulate the split start or end. So we added support for that. Similarly instead LRR counting read bytes itself, it now asks the stream about the position. This feature makes it possible for a codec to manipulate indirectly that how long a reader should keep on reading. We think these features can help for other block based codecs to do splitting easily without further changes in LRR.
          ..............................................................

          So in summary this patch might need relatively stringent reviewing by the committers due to its heavy usage but it does add useful functionality in a seamless way for the end user.

          Show
          Abdul Qadeer added a comment - I totally agree with Owen that LineRecordReader (LRR) is a heavily used code in Hadoop and changes to such a code should be very carefully done. But that doesn't mean that, LRR code is closed for any improvements and feature enhancements. This LRR deals with Text data and Hadoop uses it as a default because probably Text is the most frequently used kind on input. I think same is true for BZip2 compressed files. If not all, most of them are Text compressed data. This patch provides the following feature to the end user. User puts his BZip2 compressed text files in an input directory and submits the Hadoop job. Soon he gets the result in the output directory. That is it! He hadn't had to write or mention any input format any record reader And all this happens using full cluster CPU power (due to BZip2 splitting). Also our algorithm doesn't demand any specific kind of splits from Hadoop. It works with whatever splits provided to it. ............................................................................. Now comes the correctness of LRR code. testFormat() test case in org.apache.hadoop.mapred.TestTextInputFormat is a very stringent test case to ensure the correctness of LRR. And since this test case is frequently run whenever someone submits and tests a patch, so I dont think that any correctness problem in LRR can escape. Now comes the things like HADOOP-3114 . That was my mistake as mentioned by Chris also in his comments. I have corrected that and will upload the new patch soon after running the tests locally. I think these are the places where the vision and knowledge of Hadoop commmiters comes handy. If you see that other than HADOOP-3114 , I have missed something else, please tell me that and I will fix it. Additionally Hadoop has a very active user community and any latent bug in the code (especially the one used heavily) can not hide for long. Also the serious businesses using Hadoop usually use only the stable version (e.g I think Yahoo yet uses 0.18? while 0.20 is out). So I see a safe transition to the changed LRR code. ........................................................ Now comes the point of making a new BZip2 specific input format / record reader. If we make a separate reader for BZip2 e.g BZip2LineRecordReader, in that class most of the code will be the replica of LRR. And doing that way, GZip support should also come out of LRR and there will be a GZipLineRecordReader, again mostly the code a replica of LRR. Whenever there comes a new codec in Hadoop we will make a new reader whose 99% code is the replica of LRR. So in my view it makes more sense to handle Line related text (be it plain or compressed) in LRR. When we were adding splitting support for BZip2, we felt that there might be situations when codecs want to change/manipulate the split start or end. So we added support for that. Similarly instead LRR counting read bytes itself, it now asks the stream about the position. This feature makes it possible for a codec to manipulate indirectly that how long a reader should keep on reading. We think these features can help for other block based codecs to do splitting easily without further changes in LRR. .............................................................. So in summary this patch might need relatively stringent reviewing by the committers due to its heavy usage but it does add useful functionality in a seamless way for the end user.
          Hide
          Abdul Qadeer added a comment -

          I am making a new patch to accommodate concerns raised by Chris and Owen. Few files are in the "common" sub-project while the rest are in the "mapreduce". Since the older Hadoop project has been split into 3 sub-projects, how should I make the new patch?

          Should I make two patches or merge them to make one?

          Show
          Abdul Qadeer added a comment - I am making a new patch to accommodate concerns raised by Chris and Owen. Few files are in the "common" sub-project while the rest are in the "mapreduce". Since the older Hadoop project has been split into 3 sub-projects, how should I make the new patch? Should I make two patches or merge them to make one?
          Hide
          Abdul Qadeer added a comment -

          The older patch has been split into two (due to the new Hadoop sub-projects: common, mapreduce and HDFS). This patch has the work related to common while the portion going into MapReduce will be added soon on a separate MapReduce JIRA.

          Show
          Abdul Qadeer added a comment - The older patch has been split into two (due to the new Hadoop sub-projects: common, mapreduce and HDFS). This patch has the work related to common while the portion going into MapReduce will be added soon on a separate MapReduce JIRA.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12415479/Hadoop-4012-version10.patch
          against trunk revision 800919.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          -1 javac. The applied patch generated 118 javac compiler warnings (more than the trunk's current 116 warnings).

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/591/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/591/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/591/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/591/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12415479/Hadoop-4012-version10.patch against trunk revision 800919. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. -1 javac. The applied patch generated 118 javac compiler warnings (more than the trunk's current 116 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/591/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/591/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/591/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/591/console This message is automatically generated.
          Hide
          Abdul Qadeer added a comment -

          Fixes javadoc and javac warnings.

          Show
          Abdul Qadeer added a comment - Fixes javadoc and javac warnings.
          Hide
          Abdul Qadeer added a comment -

          The MapReduce related stuff from the original HADOOP-4012 has been shifted to MapReduce-830 (https://issues.apache.org/jira/browse/MAPREDUCE-830). It was necessary because of the recent forking of Hadoop into three sub-projects. MapReduce-830 will only compile once this HADOOP-4012 commits.

          Show
          Abdul Qadeer added a comment - The MapReduce related stuff from the original HADOOP-4012 has been shifted to MapReduce-830 ( https://issues.apache.org/jira/browse/MAPREDUCE-830 ). It was necessary because of the recent forking of Hadoop into three sub-projects. MapReduce-830 will only compile once this HADOOP-4012 commits.
          Hide
          Abdul Qadeer added a comment -

          Re-submitting because the patch is not appearing in the Hudson queue.

          Show
          Abdul Qadeer added a comment - Re-submitting because the patch is not appearing in the Hudson queue.
          Hide
          Amr Awadallah added a comment -

          I am out of office until Aug 14th. I will be checking my email
          intermittently. If this is urgent then please call my cell phone,
          otherwise I will reply to your email when I get back.

          Thanks for your patience,

          – amr

          Show
          Amr Awadallah added a comment - I am out of office until Aug 14th. I will be checking my email intermittently. If this is urgent then please call my cell phone, otherwise I will reply to your email when I get back. Thanks for your patience, – amr
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12415766/Hadoop-4012-version11.patch
          against trunk revision 802224.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/594/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/594/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/594/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/594/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12415766/Hadoop-4012-version11.patch against trunk revision 802224. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/594/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/594/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/594/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/594/console This message is automatically generated.
          Hide
          Abdul Qadeer added a comment -

          This is a reminder for the Hadoop committers that HADOOP-4012 is ready to be reviewed.

          Show
          Abdul Qadeer added a comment - This is a reminder for the Hadoop committers that HADOOP-4012 is ready to be reviewed.
          Hide
          Yuri Pradkin added a comment -

          Gentle Hadoop folk,

          may I ask what we are waiting for while not committing this patch? If somebody is unhappy with this patch, it would be good to know their concern(s); likewise, if there are some internal discussions going on, it would be wonderful if somebody occasionally filled us in on that. This patch has been in the queue for a long time and now that Hudson is finally happy with it (and I know for various reasons it is hard to satisfy can somebody take charge and check it in? If you are unwilling to check it in, please explain!

          I think by delaying it, we're risking running into conflicts yet again and creating unnecessary work for ourselves.

          Thank you!

          Show
          Yuri Pradkin added a comment - Gentle Hadoop folk, may I ask what we are waiting for while not committing this patch? If somebody is unhappy with this patch, it would be good to know their concern(s); likewise, if there are some internal discussions going on, it would be wonderful if somebody occasionally filled us in on that. This patch has been in the queue for a long time and now that Hudson is finally happy with it (and I know for various reasons it is hard to satisfy can somebody take charge and check it in? If you are unwilling to check it in, please explain! I think by delaying it, we're risking running into conflicts yet again and creating unnecessary work for ourselves. Thank you!
          Hide
          Chris Douglas added a comment -

          Fair point.

          I will try to complete a review this weekend.

          Show
          Chris Douglas added a comment - Fair point. I will try to complete a review this weekend.
          Hide
          John Heidemann added a comment -

          Thanks!

          Show
          John Heidemann added a comment - Thanks!
          Hide
          Chris Douglas added a comment -

          This is what we were looking for. Thank you.

          • CBZip2InputStream::skipToNextMarker should add IOException (for bsR) to its throws list, not Exception, and throw IllegalArgumentException if markerBitLength is larger than 63. This permits the removal of the empty catch clauses in calls to this method. The javadoc should include a description of this behavior, and note that marker is the EOB delimiter (w/ @param javadoc directives). The javadoc may be self-evident, but it's a public API.
          • updateProcessedByteCount and updateReportedByteCount are public methods, but they seem very, very specialized. The comment explains that the client may manipulate the compressed stream, but any client sophisticated enough to do that is likely bound to this class. Would making these methods protected make sense?
          • This method:
            +  private static void reportCRCError() throws IOException {
            +    throw new IOException("crc error");
            +  }
            

            Seems unnecessary. If the intent is to add a hook, then the method should be protected and non-static. Otherwise, there's no reason not to simply throw the IOE at its callers.

          • The API issue with InputStreamCreationResultant is clearer to me, now (I hadn't seen why a reader might need a modified start). Other than synchronizing on the codec before creating new streams (to avoid the race condition), I don't see a better way to do this without pushing other API changes. Unless someone has a better idea, I think documenting this requirement on SplittableCompressionCodec is sufficient for now (and making these methods synchronized in BZip2Codec).
          Show
          Chris Douglas added a comment - This is what we were looking for. Thank you. CBZip2InputStream::skipToNextMarker should add IOException (for bsR) to its throws list, not Exception, and throw IllegalArgumentException if markerBitLength is larger than 63. This permits the removal of the empty catch clauses in calls to this method. The javadoc should include a description of this behavior, and note that marker is the EOB delimiter (w/ @param javadoc directives). The javadoc may be self-evident, but it's a public API. updateProcessedByteCount and updateReportedByteCount are public methods, but they seem very, very specialized. The comment explains that the client may manipulate the compressed stream, but any client sophisticated enough to do that is likely bound to this class. Would making these methods protected make sense? This method: + private static void reportCRCError() throws IOException { + throw new IOException("crc error"); + } Seems unnecessary. If the intent is to add a hook, then the method should be protected and non-static. Otherwise, there's no reason not to simply throw the IOE at its callers. The API issue with InputStreamCreationResultant is clearer to me, now (I hadn't seen why a reader might need a modified start). Other than synchronizing on the codec before creating new streams (to avoid the race condition), I don't see a better way to do this without pushing other API changes. Unless someone has a better idea, I think documenting this requirement on SplittableCompressionCodec is sufficient for now (and making these methods synchronized in BZip2Codec ).
          Hide
          Abdul Qadeer added a comment -

          The API issue with InputStreamCreationResultant is clearer to me, now (I hadn't seen why a reader might need a modified start). Other than synchronizing on the codec before creating new streams (to avoid the race condition), I don't see a better way to do this without pushing other API changes. Unless someone has a better idea, I think documenting this requirement on SplittableCompressionCodec is sufficient for now (and making these methods synchronized in BZip2Codec).

          From https://issues.apache.org/jira/browse/MAPREDUCE-830
          Though it's not changed in bzip, since getEnd is part of the API, it should be called in LineRecordReader.
          Since the codec has state, the API demands that LineRecordReader synchronize on the codec before creating a splittable stream and calling getStart and getEnd to avoid race conditions (unless a better solution is found in HADOOP-4012)

          Does the code in LineRecordReader is executed by multiple threads? I found some discussion about it here http://issues.apache.org/jira/browse/HADOOP-3554 but that discussion probably didn't conclude.

          Even if it is run by multiple threads, I am not able to see the race conditions because start / end is changed only once in the constructor (I am assuming that LineRecordReader constructor is not called by multiple threads simultaneously)

          The only problem I see is that LineReacordReader should not forget to call getStart() method after calling createInputStream() method. Or saying another way, any one using SplittableCompressionCodec must call getStart() / get End() methods after creating the stream.

          So I am little confused about the race condition comments. Can you please help me understand?

          Thanks.

          Show
          Abdul Qadeer added a comment - The API issue with InputStreamCreationResultant is clearer to me, now (I hadn't seen why a reader might need a modified start). Other than synchronizing on the codec before creating new streams (to avoid the race condition), I don't see a better way to do this without pushing other API changes. Unless someone has a better idea, I think documenting this requirement on SplittableCompressionCodec is sufficient for now (and making these methods synchronized in BZip2Codec). From https://issues.apache.org/jira/browse/MAPREDUCE-830 Though it's not changed in bzip, since getEnd is part of the API, it should be called in LineRecordReader. Since the codec has state, the API demands that LineRecordReader synchronize on the codec before creating a splittable stream and calling getStart and getEnd to avoid race conditions (unless a better solution is found in HADOOP-4012 ) Does the code in LineRecordReader is executed by multiple threads? I found some discussion about it here http://issues.apache.org/jira/browse/HADOOP-3554 but that discussion probably didn't conclude. Even if it is run by multiple threads, I am not able to see the race conditions because start / end is changed only once in the constructor (I am assuming that LineRecordReader constructor is not called by multiple threads simultaneously) The only problem I see is that LineReacordReader should not forget to call getStart() method after calling createInputStream() method. Or saying another way, any one using SplittableCompressionCodec must call getStart() / get End() methods after creating the stream. So I am little confused about the race condition comments. Can you please help me understand? Thanks.
          Hide
          Chris Douglas added a comment -

          True, there isn't threading issue with LineRecordReader. However, this is the only codec with state, SplittableCompressionCodec won't just be used in LineRecordReader and BZip2Codec may not be the last splittable codec. Though I don't think most codecs are accessed by multiple threads, synchronization is cheap, and the first user to implement a splittable reader with a shared codec will almost certainly cargo-cult this implementation and miss the race condition. If you'd rather leave it this way, then OK.

          There was another part of the patch, in BZip2Codec::createInputStream:

          +    if(in.getPos() <= start){
          +      ((Seekable)seekableIn).seek(start);
          +        in = this.createInputStream(seekableIn, readMode);
          +        
          +
          +    }
          

          This drops the stream it created above, which wrapped the stream passed in with a CBZip2InputStream and BufferedInputStream. It's not clear why the stream is being re-created, either... particularly since the start stored in the codec is left alone. What case is being handled here?

          Show
          Chris Douglas added a comment - True, there isn't threading issue with LineRecordReader. However, this is the only codec with state, SplittableCompressionCodec won't just be used in LineRecordReader and BZip2Codec may not be the last splittable codec. Though I don't think most codecs are accessed by multiple threads, synchronization is cheap, and the first user to implement a splittable reader with a shared codec will almost certainly cargo-cult this implementation and miss the race condition. If you'd rather leave it this way, then OK. There was another part of the patch, in BZip2Codec::createInputStream: + if(in.getPos() <= start){ + ((Seekable)seekableIn).seek(start); + in = this.createInputStream(seekableIn, readMode); + + + } This drops the stream it created above, which wrapped the stream passed in with a CBZip2InputStream and BufferedInputStream. It's not clear why the stream is being re-created, either... particularly since the start stored in the codec is left alone. What case is being handled here?
          Hide
          Chris Douglas added a comment -

          I tried a version of this using a supertype of CompressionInputStream instead of the semantics tried so far (voiding the synchronization discussion). It doesn't incorporate the other changes discussed.

          Show
          Chris Douglas added a comment - I tried a version of this using a supertype of CompressionInputStream instead of the semantics tried so far (voiding the synchronization discussion). It doesn't incorporate the other changes discussed.
          Hide
          Abdul Qadeer added a comment -
          +    if(in.getPos() <= start){
          +      ((Seekable)seekableIn).seek(start);
          +        in = this.createInputStream(seekableIn, readMode);
          +        
          +
          +    }
          

          This drops the stream it created above, which wrapped the stream passed in with a CBZip2InputStream and BufferedInputStream. It's not clear why the stream is being re-created, either... particularly since the start stored in the codec is left alone. What case is being handled here?

          The reason to re-create the stream for the case when in.getPos() <= start is to tackle the cases like the following:

          Assume [BBBBBB] represents a BZip2 maker and d is a single compressed data element (this can happen
          e.g. due to BZip2 concatenation)

          There is some extra information at the start of stream i.e. BZ0h

          ^ indicates where currently the stream is:

          [BZh0BBBBBB]d[BBBBBB]d[BBBBBB]d[BBBBBB]
          _______________________________ ^
          
          I go back 10 bytes in the stream before finding a marker.  The reason
          is that the first 'maker' is 10 bytes long, all others are 6 bytes long.
          
          So after going backwards the stream position is as follows:
          
          [BZh0BBBBBB]d[BBBBBB]d[BBBBBB]d[BBBBBB]
          __________________ ^
          
          Now finding next marker might align us with the wrong marker as follows:
          
          [BZh0BBBBBB]d[BBBBBB]d[BBBBBB]d[BBBBBB]
          ______________________ ^
          

          So for such cases the code mentioned above works. But you rightly mentioned that I should have done this.start = start at the end of above code as well.

          I tried a version of this using a supertype of CompressionInputStream instead of the semantics tried so far (voiding the synchronization discussion). It doesn't incorporate the other changes discussed.

          The new version looks fine to me. Let me incorporate the other changes you mentioned in it and to put the new patch on the JIRA

          Show
          Abdul Qadeer added a comment - + if(in.getPos() <= start){ + ((Seekable)seekableIn).seek(start); + in = this.createInputStream(seekableIn, readMode); + + + } This drops the stream it created above, which wrapped the stream passed in with a CBZip2InputStream and BufferedInputStream. It's not clear why the stream is being re-created, either... particularly since the start stored in the codec is left alone. What case is being handled here? The reason to re-create the stream for the case when in.getPos() <= start is to tackle the cases like the following: Assume [BBBBBB] represents a BZip2 maker and d is a single compressed data element (this can happen e.g. due to BZip2 concatenation) There is some extra information at the start of stream i.e. BZ0h ^ indicates where currently the stream is: [BZh0BBBBBB]d[BBBBBB]d[BBBBBB]d[BBBBBB] _______________________________ ^ I go back 10 bytes in the stream before finding a marker. The reason is that the first 'maker' is 10 bytes long, all others are 6 bytes long. So after going backwards the stream position is as follows: [BZh0BBBBBB]d[BBBBBB]d[BBBBBB]d[BBBBBB] __________________ ^ Now finding next marker might align us with the wrong marker as follows: [BZh0BBBBBB]d[BBBBBB]d[BBBBBB]d[BBBBBB] ______________________ ^ So for such cases the code mentioned above works. But you rightly mentioned that I should have done this.start = start at the end of above code as well. I tried a version of this using a supertype of CompressionInputStream instead of the semantics tried so far (voiding the synchronization discussion). It doesn't incorporate the other changes discussed. The new version looks fine to me. Let me incorporate the other changes you mentioned in it and to put the new patch on the JIRA
          Hide
          John Heidemann added a comment -

          Abdul, a comment explaining that case (or summarizing it and pointing at the JIRA) seems important, rather than just creating the stream with no motiviation.

          Show
          John Heidemann added a comment - Abdul, a comment explaining that case (or summarizing it and pointing at the JIRA) seems important, rather than just creating the stream with no motiviation.
          Hide
          Abdul Qadeer added a comment -

          All the issues discussed have been resolved and merged into Chris' changes in C4012-12.patch except making
          BZip2InputStream::updateReportedByteCount (...) method protected. This method is called from

          private static class BZip2CompressionInputStream extends SplitCompressionInputStream

          So I have left it public.

          The corresponding patch in MAPREDUCE-830 doesn't need any changes and is current.

          Show
          Abdul Qadeer added a comment - All the issues discussed have been resolved and merged into Chris' changes in C4012-12.patch except making BZip2InputStream::updateReportedByteCount (...) method protected. This method is called from private static class BZip2CompressionInputStream extends SplitCompressionInputStream So I have left it public. The corresponding patch in MAPREDUCE-830 doesn't need any changes and is current.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12418490/C4012-13.patch
          against trunk revision 811572.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/18/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/18/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/18/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/18/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418490/C4012-13.patch against trunk revision 811572. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/18/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/18/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/18/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/18/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          Added a unit test for LineReader/SplittableCompressionCodec to TestCodec.

          Show
          Chris Douglas added a comment - Added a unit test for LineReader/SplittableCompressionCodec to TestCodec.
          Hide
          Chris Douglas added a comment -

          It should be noted that "Spli tt ableCompressionCodec" chooses a different spelling than FileInputFormat::isSpli t able. Popular search engines return more results for the former than the latter, but if anyone cares about consistency on this point then now's the time to object.

          Show
          Chris Douglas added a comment - It should be noted that "Spli tt ableCompressionCodec" chooses a different spelling than FileInputFormat::isSpli t able. Popular search engines return more results for the former than the latter, but if anyone cares about consistency on this point then now's the time to object.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12418879/C4012-14.patch
          against trunk revision 812317.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/23/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/23/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/23/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/23/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418879/C4012-14.patch against trunk revision 812317. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/23/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/23/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/23/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/23/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          +1

          I committed this. Thanks, Abdul!

          Show
          Chris Douglas added a comment - +1 I committed this. Thanks, Abdul!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #25 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/25/)
          . Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #25 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/25/ ) . Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #29 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/29/)
          . Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #29 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/29/ ) . Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #26 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/26/)
          . Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #26 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/26/ ) . Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #92 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/92/)
          . Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #92 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/92/ ) . Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #80 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/80/)
          . Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #80 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/80/ ) . Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #26 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/26/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #26 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/26/ )
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #6 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/6/)
          . Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #6 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/6/ ) . Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer
          Hide
          Robert Chansler added a comment -

          Editorial pass over all release notes prior to publication of 0.21.

          Show
          Robert Chansler added a comment - Editorial pass over all release notes prior to publication of 0.21.
          Hide
          Tim Broberg added a comment -

          Is it possible to get this committed to a stable release such as 0.20.206?

          I will see if I can get commit rights to submit a patch.

          • Tim.
          Show
          Tim Broberg added a comment - Is it possible to get this committed to a stable release such as 0.20.206? I will see if I can get commit rights to submit a patch. Tim.
          Hide
          Matt Foley added a comment -

          per request by Tim Broberg.

          Show
          Matt Foley added a comment - per request by Tim Broberg.
          Hide
          Matt Foley added a comment -

          Since this bug has been closed and cannot be reopened, I've created a new bug for the port to branch-0.20-security, HADOOP-7823. Please submit a patch to that Jira and we'll get it reviewed. Thanks.

          Show
          Matt Foley added a comment - Since this bug has been closed and cannot be reopened, I've created a new bug for the port to branch-0.20-security, HADOOP-7823 . Please submit a patch to that Jira and we'll get it reviewed. Thanks.

            People

            • Assignee:
              Abdul Qadeer
              Reporter:
              Abdul Qadeer
            • Votes:
              2 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development