Issue Details (XML | Word | Printable)

Key: HADOOP-4012
Type: New Feature New Feature
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Abdul Qadeer
Reporter: Abdul Qadeer
Votes: 2
Watchers: 24
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Providing splitting support for bzip2 compressed files

Created: 23/Aug/08 03:39 AM   Updated: 08/Oct/09 04:04 AM
Return to search
Component/s: io
Affects Version/s: 0.21.0
Fix Version/s: 0.21.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works C4012-12.patch 2009-09-02 12:06 AM Chris Douglas 55 kB
Text File Licensed for inclusion in ASF works C4012-13.patch 2009-09-03 09:04 AM Abdul Qadeer 52 kB
Text File Licensed for inclusion in ASF works C4012-14.patch 2009-09-08 05:13 AM Chris Douglas 61 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version1.patch 2008-11-22 07:58 AM Abdul Qadeer 51 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version10.patch 2009-08-04 10:23 AM Abdul Qadeer 51 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version11.patch 2009-08-06 05:49 PM Abdul Qadeer 51 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version2.patch 2008-11-24 02:16 AM Abdul Qadeer 52 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version3.patch 2008-11-26 02:29 AM Abdul Qadeer 52 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version4.patch 2008-11-28 05:45 AM Abdul Qadeer 52 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version5.patch 2009-03-31 02:39 PM Abdul Qadeer 51 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version6.patch 2009-04-06 11:09 AM Abdul Qadeer 51 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version7.patch 2009-05-14 12:27 PM Abdul Qadeer 67 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version8.patch 2009-05-27 06:20 AM Abdul Qadeer 70 kB
Text File Licensed for inclusion in ASF works Hadoop-4012-version9.patch 2009-06-01 11:08 AM Abdul Qadeer 70 kB
Issue Links:
Blocker
 
Dependants
 
Reference
 
dependent
 

Hadoop Flags: Reviewed
Release Note: BZip2 files can now be split.
Resolution Date: 10/Sep/09 08:52 PM


 Description  « Hide
Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.

BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).

We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Abdul Qadeer made changes - 23/Aug/08 03:40 AM
Field Original Value New Value
Description Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.

BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed
by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).

We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's
compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code.
Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.

BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).

We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code.
Abdul Qadeer made changes - 23/Aug/08 03:46 AM
Link This issue relates to HADOOP-3646 [ HADOOP-3646 ]
Abdul Qadeer made changes - 23/Aug/08 03:49 AM
Link This issue depends on HADOOP-4010 [ HADOOP-4010 ]
Abdul Qadeer made changes - 23/Aug/08 03:50 AM
Link This issue depends on HADOOP-4010 [ HADOOP-4010 ]
Abdul Qadeer made changes - 23/Aug/08 03:51 AM
Link This issue depends upon HADOOP-4010 [ HADOOP-4010 ]
Abdul Qadeer added a comment - 22/Nov/08 07:58 AM
Patch to add bzip2 splitting support.

Abdul Qadeer made changes - 22/Nov/08 07:58 AM
Attachment Hadoop-4012-version1.patch [ 12394477 ]
Abdul Qadeer made changes - 22/Nov/08 07:59 AM
Fix Version/s 0.19.1 [ 12313473 ]
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 23/Nov/08 07:21 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12394477/Hadoop-4012-version1.patch
against trunk revision 719787.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 9 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3639/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3639/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3639/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3639/console

This message is automatically generated.


Matei Zaharia added a comment - 23/Nov/08 09:32 PM
I looked at the test failures and they do not seem to be related to this
patch (they don't even run the fair scheduler).

Abdul Qadeer made changes - 24/Nov/08 02:09 AM
Status Patch Available [ 10002 ] In Progress [ 3 ]
Abdul Qadeer added a comment - 24/Nov/08 02:16 AM
The following change was done in this new patch. Before this change, getPos()
was returning values one less than what it should be. Similarly available() method
was returning -1 because the value of count becomes -1 at the end of the chunk.

I hope it will resolve at least two of the 4 bugs.
===================================================================
— src/core/org/apache/hadoop/fs/FSInputChecker.java (revision 720041)
+++ src/core/org/apache/hadoop/fs/FSInputChecker.java (working copy)
@@ -295,12 +295,12 @@

@Override
public synchronized long getPos() throws IOException { - return chunkPos-(count-pos); + return chunkPos-((count-pos)<0?0:count-pos); }

@Override
public synchronized int available() throws IOException { - return count-pos; + return (count-pos)<0?0:count-pos; }


Abdul Qadeer made changes - 24/Nov/08 02:16 AM
Attachment Hadoop-4012-version2.patch [ 12394526 ]
Abdul Qadeer made changes - 24/Nov/08 02:16 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Abdul Qadeer added a comment - 25/Nov/08 04:21 AM
Looks like Hudson is stuck on my patch. Should I cancel patch and re-submit?

Abdul Qadeer added a comment - 25/Nov/08 08:28 AM
Since Hudson queue seems stuck, I am canceling this patch.

Abdul Qadeer made changes - 25/Nov/08 08:28 AM
Status Patch Available [ 10002 ] Open [ 1 ]
Hadoop QA added a comment - 26/Nov/08 12:50 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12394526/Hadoop-4012-version2.patch
against trunk revision 720162.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 9 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3644/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3644/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3644/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3644/console

This message is automatically generated.


Abdul Qadeer made changes - 26/Nov/08 02:29 AM
Attachment Hadoop-4012-version3.patch [ 12394714 ]
Abdul Qadeer made changes - 26/Nov/08 02:30 AM
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 28/Nov/08 04:36 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12394714/Hadoop-4012-version3.patch
against trunk revision 720930.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 9 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3663/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3663/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3663/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3663/console

This message is automatically generated.


Abdul Qadeer added a comment - 28/Nov/08 05:45 AM
This patch fixes the warning found by findbug

Abdul Qadeer made changes - 28/Nov/08 05:45 AM
Attachment Hadoop-4012-version4.patch [ 12394873 ]
Abdul Qadeer made changes - 28/Nov/08 05:45 AM
Status Patch Available [ 10002 ] In Progress [ 3 ]
Abdul Qadeer made changes - 28/Nov/08 05:46 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Chris Douglas made changes - 30/Nov/08 02:46 AM
Affects Version/s 0.19.0 [ 12313211 ]
Fix Version/s 0.19.1 [ 12313473 ]
Fix Version/s 0.20.0 [ 12313438 ]
Chris Douglas added a comment - 04/Dec/08 04:22 AM

The following change was done in this new patch. Before this change, getPos()
was returning values one less than what it should be. Similarly available() method
was returning -1 because the value of count becomes -1 at the end of the chunk.

Should this change be part of a separate issue, then? I'm not sure what you mean by "two of the 4 bugs", but bug fixes shouldn't be part of large, new features if the fix is unaffected by the feature.

  • This modifies TestMultipleCacheFiles to append a newline at the end of the file. Why is this necessary? Is this the same problem as HADOOP-4182?
  • Pushing the READ_MODE abstraction (and the new createInputStream) into the CompressionCodec interface, particularly when only bzip supports it, is inappropriate. If it's applicable to codecs other than bzip, it should be a separate interface (extending CompressionCodec?). This would also let instanceof replace canDecompressSplitInput and move seekBackwards to the new interface. Can you describe what it means for a codec to implement this superset of functions?
  • This patch incorporates HADOOP-4010:
    -    while (pos < end) {
    +    // We always read one extra line, which lies outside the upper
    +    // split limit i.e. (end - 1)
    +    pos = this.getPos();
    +    
    +    while (pos <= end) {
    
    +    // If this is not the first split, we always throw away first record
    +    // because we always (except the last split) read one extra line in
    +    // next() method.
    

    Shouldn't this remain with the original JIRA? Are the issues raised there addressed in this patch?

  • Does this add the Seekable interface to CompressionInputStream only to support getPos() for LineRecordReader?

This affects too many core components to make the feature freeze for 0.20 (Fri).


Chris Douglas made changes - 04/Dec/08 04:22 AM
Fix Version/s 0.20.0 [ 12313438 ]
Status Patch Available [ 10002 ] Open [ 1 ]
Abdul Qadeer added a comment - 04/Dec/08 06:04 AM

The following change was done in this new patch. Before this change, getPos()
was returning values one less than what it should be. Similarly available() method
was returning -1 because the value of count becomes -1 at the end of the chunk.

Should this change be part of a separate issue, then? I'm not sure what you mean by "two of the 4 bugs", but bug fixes shouldn't be part of large, new features if the fix is unaffected by the feature.


Hudson informed about 4 core test case failures when it ran patch version 1. Two of them were arising due to my patch, while the other two were something else which automatically got away later. The new code in LineRecordReader now depends on getPos() method of FSDataInputStream to find the current stream position instead of keeping this position record itself. And getPos() was returning -1 value at the end, when in fact it should not have gone below 0. So I changed getPos() and available() methods in src/core/org/apache/hadoop/fs/FSInputChecker.java
Since these changes were necessary for my code to work correctly, I included it here.

This modifies TestMultipleCacheFiles to append a newline at the end of the file. Why is this necessary? Is this the same problem as HADOOP-4182?


Yes it is the same issue as mentioned in Hadoop-4182.

Pushing the READ_MODE abstraction (and the new createInputStream) into the CompressionCodec interface, particularly when only bzip supports it, is inappropriate. If it's applicable to codecs other than bzip, it should be a separate interface (extending CompressionCodec?). This would also let instanceof replace canDecompressSplitInput and move seekBackwards to the new interface. Can you describe what it means for a codec to implement this superset of functions?


Amongst the currently supported codecs, I think only BZip2 can decompress block of a data. But I am not sure that BZip2 is the only one out there (ONly a compression expert can tell that which compression algorithms can decompress blocks of data without needing the whole compressed file) So READ_MODE abstration provides two modes of operation. The first one is "continous", where a codec must need the whole compressed
file to decompress successfully. The other mode is "By_Block", where codecs can decompress parts of a compressed file and whole file is not needed. Hence such codecs can be used effectively with Hadoop kind of splitting.

"canDecompressSplitInput" asks the codecs that can it decompress given a stream starting at some random position in a compressed file. Currently Hadoop assumes that compressed files must be handeled by one mapper because splitting is not possible. This assumption
is not ture, atleast in the case of BZip2.

"seekBackwards" was needed for the cases when Hadoop split start lies exactly on a BZip2 marker. These markers are identifications for the algorithm that a new block is starting. To work the bzip2 codec correctly, the stream needs to move back about as much as the size of the Marker. Ideally this should be done in codec class but buffering at different level, made it difficult to move the stream backwards sitting in the
CBZip2InputStream. So "seekBackwards" means that in LineRecordReader, Hadoop asks the codecs that does it need to move back the stream.

Making a separate interface for BZip2 style of codec is another possibility. I had two options, that either to change the current interfaces or make a new one. I took the first option leaving the final decision to the Hadoop committers. But changing it as suggested would mean a lot of changes and testing for me. But if committers think that, it must be changed and can not be accepted like this then I will change it as suggested.

This patch incorporates HADOOP-4010: Shouldn't this remain with the original JIRA? Are the issues raised there addressed in this patch?


Yes the code is the same as submitted in Hadoop-4010. I think the issues mentioned there does not affect the patch and it looks fine. I posted it there but no one later commented on it. So i had to include it here as well for my patch to work.

Does this add the Seekable interface to CompressionInputStream only to support getPos() for LineRecordReader?


Yes that is right.

This affects too many core components to make the feature freeze for 0.20 (Fri)


I think if release 0.20 is planned soon, we can change its effected version to next one.


Chris Douglas added a comment - 09/Jan/09 02:14 AM

Hudson informed about 4 core test case failures when it ran patch version 1. Two of them were arising due to my patch, while the other two were something else which automatically got away later. [snip]

Ah, I see. Sorry, I hadn't understood.

Yes the code is the same as submitted in Hadoop-4010. I think the issues mentioned there does not affect the patch and it looks fine. I posted it there but no one later commented on it. So i had to include it here as well for my patch to work.

Marking this issue as depending on or blocked by the related issues, and raising them as PA again, is better than creating a composite patch. If you feel you've addressed the points made by the commenter, it's appropriate to resubmit. There's a lot of traffic, and some responses will inevitably be overlooked.

Amongst the currently supported codecs, I think only BZip2 can decompress block of a data. But I am not sure that BZip2 is the only one out there

LZO used to be a counterexample, but it has since been removed (HADOOP-4874, also why the current patch no longer applies). It doesn't support seeking from arbitrary offsets, though. In any case: even recognizing the development overhead, adding the new methods to the existing interfaces seems like the more invasive approach, particularly since bzip is the only implementer. Given that LineRecordReader, NLineInputFormat, LineReader, and FSInputChecker are all heavily used, the prospect of accidentally destabilizing them to support an optional piece of functionality is unnerving. Are there any specific reasons why the existing approach is preferred? If one were to compress binary data with the bzip codec, would it be splittable and/or readable?

Thanks for your patience with this.


Zheng Shao added a comment - 09/Jan/09 04:05 AM
I like the idea of avoiding seeking back 1 byte in LineRecordReader. With this change, each mapper will usually open only 2 data blocks, while the old approach makes each mapper open 3 data blocks. Considering all the costs about it (opening TCP connection to the remote data node, disk seek to read 1 byte, etc) this seems to be a reasonable improvement.

Abdul Qadeer added a comment - 09/Jan/09 03:39 PM
Chris Douglas:

I understand that any change to e.g. LineRecordReader (LRR) should be carried out with care because it the most frequently used code. It might not have been impossible to implement bzip2 codecs with the older implementation of LRR but with performance penalty for bzip2 (because it then had to decompress one more block per mapper) and the code in the LRR constructor and the next() method would have been a mess.

HADOOP-4010 was separately created before finishing bzip2 work but unfortunately that couldn't get in. Owen O'Malley mentioned some concerns , to which I provided some logical hand waiving. I guess Owen has been busy or is not convinced with those arguments but he didn't said anything. I plan to ask him again that should I provide some kind of test cases just to make sure that his concerns are addressed.

So going systematically I first plan to get 4010 approved and then move on by doing recommended changes. I have successfully used the bzip2 codecs for a binary compressed bzip2 files. So I hope by going step by step, I will be able to get this work approved.


Abdul Qadeer added a comment - 31/Mar/09 02:39 PM
This patch incorporates changes suggested by Chris Douglas.

Abdul Qadeer made changes - 31/Mar/09 02:39 PM
Attachment Hadoop-4012-version5.patch [ 12404241 ]
Suhas Gogate made changes - 31/Mar/09 07:35 PM
Link This issue relates to HADOOP-5601 [ HADOOP-5601 ]
Suhas Gogate added a comment - 31/Mar/09 07:35 PM
support for concatenated bzip2 compressed file. Throw error until this new feature supported to let user know about incompatible input.

Suhas Gogate made changes - 31/Mar/09 07:35 PM
Link This issue relates to HADOOP-5602 [ HADOOP-5602 ]
Abdul Qadeer added a comment - 02/Apr/09 01:09 PM
Now BZip2 can be invoked in either of the two modes;

CONTINOUS
BYBLOCK

BYBLOCK is the mode which makes it possible to work with Hadoop Split input. And it should work fine for concatenation of many BZip2 compressed files as well. Otherwise in normal old mode (i.e CONTINOUS) its behavior is the same as it used to be. Currently the default mode is set to BYBLOCK. So problem mentioned in HADOOP-5601 should be solved.


Abdul Qadeer made changes - 02/Apr/09 01:28 PM
Fix Version/s 0.19.2 [ 12313650 ]
Affects Version/s 0.19.2 [ 12313650 ]
Release Note Support to process Hadoop split BZip2 files.
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 02/Apr/09 02:09 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12404241/Hadoop-4012-version5.patch
against trunk revision 761082.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

-1 release audit. The applied patch generated 663 release audit warnings (more than the trunk's current 661 warnings).

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/100/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/100/artifact/trunk/current/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/100/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/100/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/100/console

This message is automatically generated.


Abdul Qadeer added a comment - 06/Apr/09 11:09 AM
Hudson's complaints about findbugs and release audit warnings are taken care of in this new patch.

Abdul Qadeer made changes - 06/Apr/09 11:09 AM
Attachment Hadoop-4012-version6.patch [ 12404711 ]
Abdul Qadeer made changes - 06/Apr/09 11:15 AM
Status Patch Available [ 10002 ] In Progress [ 3 ]
Abdul Qadeer made changes - 06/Apr/09 11:16 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 06/Apr/09 07:00 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12404711/Hadoop-4012-version6.patch
against trunk revision 762216.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

-1 release audit. The applied patch generated 662 release audit warnings (more than the trunk's current 659 warnings).

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/154/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/154/artifact/trunk/current/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/154/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/154/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/154/console

This message is automatically generated.


Abdul Qadeer made changes - 16/Apr/09 11:31 AM
Link This issue depends on HADOOP-5213 [ HADOOP-5213 ]
Abdul Qadeer added a comment - 16/Apr/09 11:31 AM
One of my test case, which makes different length data, compresses and decompresses at different split points depends on HADOOP-5213. HADOOP-5213's patch is currently waiting Hudson run. As HADOOP-5213 is committed, I will upload the new patch for HADOOP-4012

Abdul Qadeer made changes - 14/May/09 12:27 PM
Attachment Hadoop-4012-version7.patch [ 12408129 ]
Abdul Qadeer made changes - 15/May/09 06:22 AM
Status Patch Available [ 10002 ] In Progress [ 3 ]
Abdul Qadeer added a comment - 15/May/09 06:23 AM
Patch was no appearing in Hudson queue so I am trying to re-submit the patch.

Abdul Qadeer made changes - 15/May/09 06:23 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 19/May/09 11:46 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12408129/Hadoop-4012-version7.patch
against trunk revision 776148.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 3 new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

-1 release audit. The applied patch generated 494 release audit warnings (more than the trunk's current 486 warnings).

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/356/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/356/artifact/trunk/current/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/356/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/356/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/356/console

This message is automatically generated.


Abdul Qadeer added a comment - 20/May/09 03:43 PM
Looks like there is some problem with Hudson, atleast as far as core and contrib test are concerned. If you see Hudson queue (http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/) you will see that many JIRAs are ending up with the same 4 or 5 test case failures. e.g.

http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/362/#showFailuresLink

http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/361/#showFailuresLink

http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/360/#showFailuresLink

http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/358/#showFailuresLink

Similarly though Hudson complains about test failures but does not list the failed tests (like in this JIRA's case and there are many others)

I will test my patch on a local box and will post the results here.


Johan Oskarsson made changes - 22/May/09 01:49 PM
Fix Version/s 0.19.2 [ 12313650 ]
Fix Version/s 0.21.0 [ 12313563 ]
Abdul Qadeer made changes - 27/May/09 06:19 AM
Affects Version/s 0.19.2 [ 12313650 ]
Affects Version/s 0.21.0 [ 12313563 ]
Status Patch Available [ 10002 ] In Progress [ 3 ]
Abdul Qadeer added a comment - 27/May/09 06:20 AM
This patch fixes audit release and find bug warnings.

Abdul Qadeer made changes - 27/May/09 06:20 AM
Attachment Hadoop-4012-version8.patch [ 12409127 ]
Abdul Qadeer made changes - 27/May/09 06:21 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 29/May/09 06:21 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12409127/Hadoop-4012-version8.patch
against trunk revision 779807.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

-1 release audit. The applied patch generated 501 release audit warnings (more than the trunk's current 493 warnings).

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/422/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/422/artifact/trunk/current/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/422/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/422/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/422/console

This message is automatically generated.


Yuri Pradkin added a comment - 29/May/09 04:47 PM
OK, it looks like this patch is in reasonable shape now. The failing tests seem to be the ones failing for everybody else.
We've been using an older version of the patch for some time now with a custom input format and have had consistent results.
Not to mention the speedup due to parallelism.

Can one of the committers please have a look?


Abdul Qadeer added a comment - 01/Jun/09 11:08 AM
1) Patch to fix audit release warnings.

Abdul Qadeer made changes - 01/Jun/09 11:08 AM
Attachment Hadoop-4012-version9.patch [ 12409554 ]
Abdul Qadeer made changes - 02/Jun/09 08:55 AM
Status Patch Available [ 10002 ] In Progress [ 3 ]
Abdul Qadeer made changes - 02/Jun/09 08:55 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 04/Jun/09 08:43 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12409554/Hadoop-4012-version9.patch
against trunk revision 781602.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

-1 release audit. The applied patch generated 500 release audit warnings (more than the trunk's current 492 warnings).

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/459/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/459/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/459/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/459/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/459/console

This message is automatically generated.


Abdul Qadeer added a comment - 04/Jun/09 10:28 AM
(1) Running test-patch target on my local box does not produce release audit warnings:

[exec]
[exec] There appear to be 503 release audit warnings before the patch and 501 release audit warnings after applying the patch.
[exec]
[exec]
[exec]
[exec]
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 3 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
[exec]

(2) Test org.apache.hadoop.hdfs.server.datanode.TestBlockReplacement.testBlockReplacement does not fail on my local box and
org.apache.hadoop.mapred.TestQueueCapacities.testMultipleQueues produces an error (But this is the same even for the committed SVN Code).

These 1 or 2 errors are occurring for other patches as well on Hudson. Additionally they are un-related as far as bzip2 is concerned.

Can the Hadoop committers please review the patch so that I can complete this work. Thank you.


Chris Douglas added a comment - 17/Jun/09 12:23 AM
  • If an IOException is ignored:
    +    if (!(in instanceof Seekable) || !(in instanceof PositionedReadable)) {
    +      try {
    +        this.maxAvailableData = in.available();
    +      } catch (IOException e) {
    +      }
    +    }
    

    Please include a comment explaining why it should be ignored. Won't this fail in the first call to getPos() if it doesn't throw in the cstr?

  • This change to TestCodec looks suspect:
    -    SequenceFile.Reader reader = new SequenceFile.Reader(fs, filePath, conf);
    -    
    +    SequenceFile.Reader reader = null;
    +    try{
    +    reader = new SequenceFile.Reader(fs, filePath, conf);
    +    }
    +    catch(Exception exp){}
    
  • Instead of the tenary operatior in FSInputChecker, Math.max(0L, count - pos) is more readable.
  • With changes such as the following, will the case resolved in HADOOP-3144 continue to work in LineRecordReader?
    -      int newSize = in.readLine(value, maxLineLength,
    -                                Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
    -                                         maxLineLength));
    +      int newSize = lineReader.readLine(value, maxLineLength, Integer.MAX_VALUE);
    
  • Specifying the constant as 1L should avoid this:
    -    return (bsBuffShadow >> (bsLiveShadow - n)) & ((1 << n) - 1);
    +    final long one = 1;
    +    return (bsBuffShadow >> (bsLiveShadow - n)) & ((one << n) - 1);
    
  • The InputStreamCreationResultant struct seems unnecessary, but perhaps I'm not following the logic there. Of the three auxiliary params returned, end is unmodified from the actual and areLimitsChanged only avoids a redundant assignment to start. Since CompressionInputStream now implements Seekable, can this be replaced with start = in.getPos()?
  • CBZip2InputStream::read() shouldn't allocate a new byte[] for every read, but reuse an instance var. (0xFF also works, btw)

Chris Douglas made changes - 17/Jun/09 09:48 PM
Status Patch Available [ 10002 ] Open [ 1 ]
Abdul Qadeer added a comment - 23/Jun/09 09:40 AM

The InputStreamCreationResultant struct seems unnecessary, but perhaps I'm not following the logic there. Of the three auxiliary params returned, end is unmodified from the actual and areLimitsChanged only avoids a redundant assignment to start. Since CompressionInputStream now implements Seekable, can this be replaced with start = in.getPos()?


(1) I can not use start = getPos() in the LineRecordReader's constructor. The reason is that, for the BZip2 compressed data getPos() does not return the actual stream value. It is manipulated such a way in BZip2Codecs that, this hack works with LineRecordReader's start / end / pos stuff. On the other hand I need to do stuff (e.g. throwing away a line if it is not the first split) which required accurate value of start. So two pieces of information from the method createInputStream(...) were required:

a) the new value of start
b) the input stream

The parameter "end" was also passed just in case some other future codec had the flexibility to change start or end or both.

So I made that InputStreamCreationResultant because I wanted to return more than one things. To avoid making a new class, I can use some array which has inputstream on its 0th location and then the parameter start. But that will not be type safe.

Another option might be to add another method in the SplitEnabledCompressionCodec to ask for the changed start value. So I will need to call createInputStream(...) and then call the new method e.g. getStart(). But the semantics of such a call were not neat and clear.

Do we have any other option to avoid this "InputStreamCreationResultant"?


(2) I have fixed the rest of the comments. Once this point is clear, I will prepare the final patch.


Yuri Pradkin added a comment - 10/Jul/09 06:43 PM
Abdul, I think at this point it's OK to assume that Chris has accepted your explanation. Can you please post your latest patch with all the fixes. Beware that repository has changed: you'll need to do switch to
http://svn.apache.org/repos/asf/hadoop/common/trunk/ or checkout a fresh copy and reapply your patch.

Guys, can we please wrap this up? It's taking way too long for this patch to make it. Please. Thanks.


Owen O'Malley added a comment - 10/Jul/09 08:27 PM
-1 to changing LineRecordReader. In particular, you've undone changes that were made by other jiras. This is very very touchy code that the current version of this patch breaks.

I really think that this is better done by using a separate BzipTextInputFormat and BzipLineRecordReader.


Yuri Pradkin added a comment - 10/Jul/09 08:56 PM

-1 to changing LineRecordReader. In particular, you've undone changes that were made by other jiras. This is very very touchy code that the current version of this patch breaks.

Can you please be more specific: what changes were undone? Can you please elaborate on what exactly this patch breaks (perhaps we need more tests?)?

The code indeed is very heavily used, but I think it's also very well beaten upon by various tests.

I really think that this is better done by using a separate BzipTextInputFormat and BzipLineRecordReader.

I think the point of having splitting built in - is that all readers/formats can avoid re-implementing common things. We are currently using a binary format reader that works just fine with this patch with only minor changes. Moreover, the idea is to work out a common framework where other block-compressed formats could be processed in a similar manner. The alternative that you're suggesting is to have BzipTextInputFormat, XXXBlockCompressInputFormat, YYYBlockCompressInputFormat, and so on.


Owen O'Malley added a comment - 10/Jul/09 09:26 PM

Can you please be more specific: what changes were undone?

It was the feature from HADOOP-3144 that was dropped by mistake. But that is just indicative of the problem. LineRecordReader is very delicate code that is very easy to break. Changes to this code are hard to verify and it is used by many many Hadoop applications.

I think the point of having splitting built in - is that all readers/formats can avoid re-implementing common things.

This patch only seems to support TextInputFormat and Bzip. I don't see abstractions that would make this portable to other codecs or input formats. Maybe combining your binary format with BzipTextInputFormat makes sense? For binary files, SequenceFiles is already splittable with bzip and no changes.


Abdul Qadeer added a comment - 11/Jul/09 10:17 AM
I totally agree with Owen that LineRecordReader (LRR) is a heavily used code in Hadoop and changes to such a code should be very carefully done. But that doesn't mean that, LRR code is closed for any improvements and feature enhancements. This LRR deals with Text data and Hadoop uses it as a default because probably Text is the most frequently used kind on input. I think same is true for BZip2 compressed files. If not all, most of them are Text compressed data. This patch provides the following feature to the end user.

User puts his BZip2 compressed text files in an input directory and submits the Hadoop job. Soon he gets the result in the output directory. That is it!

He hadn't had to write or mention

any input format
any record reader
And all this happens using full cluster CPU power (due to BZip2 splitting).
Also our algorithm doesn't demand any specific kind of splits from Hadoop. It works with whatever splits provided to it.
.............................................................................

Now comes the correctness of LRR code.

testFormat() test case in org.apache.hadoop.mapred.TestTextInputFormat is a very stringent test case to ensure the correctness of LRR. And since this test case is frequently run whenever someone submits and tests a patch, so I dont think that any correctness problem in LRR can escape.

Now comes the things like HADOOP-3114. That was my mistake as mentioned by Chris also in his comments. I have corrected that and will upload the new patch soon after running the tests locally. I think these are the places where the vision and knowledge of Hadoop commmiters comes handy. If you see that other than HADOOP-3114, I have missed something else, please tell me that and I will fix it. Additionally Hadoop has a very active user community and any latent bug in the code (especially the one used heavily) can not hide for long. Also the serious businesses using Hadoop usually use only the stable version (e.g I think Yahoo yet uses 0.18? while 0.20 is out).
So I see a safe transition to the changed LRR code.

........................................................
Now comes the point of making a new BZip2 specific input format / record reader.

If we make a separate reader for BZip2 e.g BZip2LineRecordReader, in that class most of the code will be the replica of LRR. And doing that way, GZip support should also come out of LRR and there will be a GZipLineRecordReader, again mostly the code a replica of LRR. Whenever there comes a new codec in Hadoop we will make a new reader whose 99% code is the replica of LRR. So in my view it makes more sense to handle Line related text (be it plain or compressed) in LRR.

When we were adding splitting support for BZip2, we felt that there might be situations when codecs want to change/manipulate the split start or end. So we added support for that. Similarly instead LRR counting read bytes itself, it now asks the stream about the position. This feature makes it possible for a codec to manipulate indirectly that how long a reader should keep on reading. We think these features can help for other block based codecs to do splitting easily without further changes in LRR.
..............................................................

So in summary this patch might need relatively stringent reviewing by the committers due to its heavy usage but it does add useful functionality in a seamless way for the end user.


Abdul Qadeer added a comment - 13/Jul/09 05:34 PM
I am making a new patch to accommodate concerns raised by Chris and Owen. Few files are in the "common" sub-project while the rest are in the "mapreduce". Since the older Hadoop project has been split into 3 sub-projects, how should I make the new patch?

Should I make two patches or merge them to make one?


Abdul Qadeer made changes - 04/Aug/09 10:19 AM
Status Open [ 1 ] In Progress [ 3 ]
Abdul Qadeer added a comment - 04/Aug/09 10:23 AM
The older patch has been split into two (due to the new Hadoop sub-projects: common, mapreduce and HDFS). This patch has the work related to common while the portion going into MapReduce will be added soon on a separate MapReduce JIRA.

Abdul Qadeer made changes - 04/Aug/09 10:23 AM
Attachment Hadoop-4012-version10.patch [ 12415479 ]
Abdul Qadeer made changes - 04/Aug/09 10:24 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 06/Aug/09 05:56 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12415479/Hadoop-4012-version10.patch
against trunk revision 800919.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

-1 javadoc. The javadoc tool appears to have generated 1 warning messages.

-1 javac. The applied patch generated 118 javac compiler warnings (more than the trunk's current 116 warnings).

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/591/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/591/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/591/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/591/console

This message is automatically generated.


Abdul Qadeer added a comment - 06/Aug/09 05:49 PM
Fixes javadoc and javac warnings.

Abdul Qadeer made changes - 06/Aug/09 05:49 PM
Attachment Hadoop-4012-version11.patch [ 12415766 ]
Abdul Qadeer made changes - 06/Aug/09 07:41 PM
Link This issue blocks MAPREDUCE-830 [ MAPREDUCE-830 ]
Abdul Qadeer added a comment - 06/Aug/09 07:51 PM
The MapReduce related stuff from the original HADOOP-4012 has been shifted to MapReduce-830 (https://issues.apache.org/jira/browse/MAPREDUCE-830). It was necessary because of the recent forking of Hadoop into three sub-projects. MapReduce-830 will only compile once this HADOOP-4012 commits.

Abdul Qadeer made changes - 08/Aug/09 11:00 AM
Status Patch Available [ 10002 ] In Progress [ 3 ]
Abdul Qadeer added a comment - 08/Aug/09 11:02 AM
Re-submitting because the patch is not appearing in the Hudson queue.

Abdul Qadeer made changes - 08/Aug/09 11:02 AM
Status In Progress [ 3 ] Patch Available [ 10002 ]
Amr Awadallah added a comment - 08/Aug/09 11:05 AM
I am out of office until Aug 14th. I will be checking my email
intermittently. If this is urgent then please call my cell phone,
otherwise I will reply to your email when I get back.

Thanks for your patience,

– amr


Hadoop QA added a comment - 08/Aug/09 01:16 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12415766/Hadoop-4012-version11.patch
against trunk revision 802224.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/594/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/594/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/594/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/594/console

This message is automatically generated.


Abdul Qadeer added a comment - 17/Aug/09 05:09 PM
This is a reminder for the Hadoop committers that HADOOP-4012 is ready to be reviewed.

Yuri Pradkin added a comment - 28/Aug/09 11:41 PM
Gentle Hadoop folk,

may I ask what we are waiting for while not committing this patch? If somebody is unhappy with this patch, it would be good to know their concern(s); likewise, if there are some internal discussions going on, it would be wonderful if somebody occasionally filled us in on that. This patch has been in the queue for a long time and now that Hudson is finally happy with it (and I know for various reasons it is hard to satisfy can somebody take charge and check it in? If you are unwilling to check it in, please explain!

I think by delaying it, we're risking running into conflicts yet again and creating unnecessary work for ourselves.

Thank you!


Chris Douglas added a comment - 29/Aug/09 01:39 AM
Fair point.

I will try to complete a review this weekend.


John Heidemann added a comment - 29/Aug/09 03:33 AM
Thanks!

Chris Douglas added a comment - 30/Aug/09 09:47 PM
This is what we were looking for. Thank you.
  • CBZip2InputStream::skipToNextMarker should add IOException (for bsR) to its throws list, not Exception, and throw IllegalArgumentException if markerBitLength is larger than 63. This permits the removal of the empty catch clauses in calls to this method. The javadoc should include a description of this behavior, and note that marker is the EOB delimiter (w/ @param javadoc directives). The javadoc may be self-evident, but it's a public API.
  • updateProcessedByteCount and updateReportedByteCount are public methods, but they seem very, very specialized. The comment explains that the client may manipulate the compressed stream, but any client sophisticated enough to do that is likely bound to this class. Would making these methods protected make sense?
  • This method:
    +  private static void reportCRCError() throws IOException {
    +    throw new IOException("crc error");
    +  }
    

    Seems unnecessary. If the intent is to add a hook, then the method should be protected and non-static. Otherwise, there's no reason not to simply throw the IOE at its callers.

  • The API issue with InputStreamCreationResultant is clearer to me, now (I hadn't seen why a reader might need a modified start). Other than synchronizing on the codec before creating new streams (to avoid the race condition), I don't see a better way to do this without pushing other API changes. Unless someone has a better idea, I think documenting this requirement on SplittableCompressionCodec is sufficient for now (and making these methods synchronized in BZip2Codec).

Chris Douglas made changes - 30/Aug/09 09:47 PM
Status Patch Available [ 10002 ] Open [ 1 ]
Abdul Qadeer added a comment - 01/Sep/09 11:20 AM

The API issue with InputStreamCreationResultant is clearer to me, now (I hadn't seen why a reader might need a modified start). Other than synchronizing on the codec before creating new streams (to avoid the race condition), I don't see a better way to do this without pushing other API changes. Unless someone has a better idea, I think documenting this requirement on SplittableCompressionCodec is sufficient for now (and making these methods synchronized in BZip2Codec).

From https://issues.apache.org/jira/browse/MAPREDUCE-830
Though it's not changed in bzip, since getEnd is part of the API, it should be called in LineRecordReader.
Since the codec has state, the API demands that LineRecordReader synchronize on the codec before creating a splittable stream and calling getStart and getEnd to avoid race conditions (unless a better solution is found in HADOOP-4012)

Does the code in LineRecordReader is executed by multiple threads? I found some discussion about it here http://issues.apache.org/jira/browse/HADOOP-3554 but that discussion probably didn't conclude.

Even if it is run by multiple threads, I am not able to see the race conditions because start / end is changed only once in the constructor (I am assuming that LineRecordReader constructor is not called by multiple threads simultaneously)

The only problem I see is that LineReacordReader should not forget to call getStart() method after calling createInputStream() method. Or saying another way, any one using SplittableCompressionCodec must call getStart() / get End() methods after creating the stream.

So I am little confused about the race condition comments. Can you please help me understand?

Thanks.


Chris Douglas added a comment - 01/Sep/09 10:09 PM
True, there isn't threading issue with LineRecordReader. However, this is the only codec with state, SplittableCompressionCodec won't just be used in LineRecordReader and BZip2Codec may not be the last splittable codec. Though I don't think most codecs are accessed by multiple threads, synchronization is cheap, and the first user to implement a splittable reader with a shared codec will almost certainly cargo-cult this implementation and miss the race condition. If you'd rather leave it this way, then OK.

There was another part of the patch, in BZip2Codec::createInputStream:

+    if(in.getPos() <= start){
+      ((Seekable)seekableIn).seek(start);
+        in = this.createInputStream(seekableIn, readMode);
+        
+
+    }

This drops the stream it created above, which wrapped the stream passed in with a CBZip2InputStream and BufferedInputStream. It's not clear why the stream is being re-created, either... particularly since the start stored in the codec is left alone. What case is being handled here?


Chris Douglas added a comment - 02/Sep/09 12:06 AM
I tried a version of this using a supertype of CompressionInputStream instead of the semantics tried so far (voiding the synchronization discussion). It doesn't incorporate the other changes discussed.

Chris Douglas made changes - 02/Sep/09 12:06 AM
Attachment C4012-12.patch [ 12418320 ]
Abdul Qadeer added a comment - 02/Sep/09 05:59 PM
+    if(in.getPos() <= start){
+      ((Seekable)seekableIn).seek(start);
+        in = this.createInputStream(seekableIn, readMode);
+        
+
+    }

This drops the stream it created above, which wrapped the stream passed in with a CBZip2InputStream and BufferedInputStream. It's not clear why the stream is being re-created, either... particularly since the start stored in the codec is left alone. What case is being handled here?

The reason to re-create the stream for the case when in.getPos() <= start is to tackle the cases like the following:

Assume [BBBBBB] represents a BZip2 maker and d is a single compressed data element (this can happen
e.g. due to BZip2 concatenation)

There is some extra information at the start of stream i.e. BZ0h

^ indicates where currently the stream is:

[BZh0BBBBBB]d[BBBBBB]d[BBBBBB]d[BBBBBB]
_______________________________ ^

I go back 10 bytes in the stream before finding a marker.  The reason
is that the first 'maker' is 10 bytes long, all others are 6 bytes long.

So after going backwards the stream position is as follows:

[BZh0BBBBBB]d[BBBBBB]d[BBBBBB]d[BBBBBB]
__________________ ^

Now finding next marker might align us with the wrong marker as follows:

[BZh0BBBBBB]d[BBBBBB]d[BBBBBB]d[BBBBBB]
______________________ ^

So for such cases the code mentioned above works. But you rightly mentioned that I should have done this.start = start at the end of above code as well.

I tried a version of this using a supertype of CompressionInputStream instead of the semantics tried so far (voiding the synchronization discussion). It doesn't incorporate the other changes discussed.

The new version looks fine to me. Let me incorporate the other changes you mentioned in it and to put the new patch on the JIRA


John Heidemann added a comment - 02/Sep/09 08:13 PM
Abdul, a comment explaining that case (or summarizing it and pointing at the JIRA) seems important, rather than just creating the stream with no motiviation.

Abdul Qadeer added a comment - 03/Sep/09 09:04 AM
All the issues discussed have been resolved and merged into Chris' changes in C4012-12.patch except making
BZip2InputStream::updateReportedByteCount (...) method protected. This method is called from

private static class BZip2CompressionInputStream extends SplitCompressionInputStream

So I have left it public.

The corresponding patch in MAPREDUCE-830 doesn't need any changes and is current.


Abdul Qadeer made changes - 03/Sep/09 09:04 AM
Attachment C4012-13.patch [ 12418490 ]
Abdul Qadeer made changes - 05/Sep/09 06:16 AM
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 05/Sep/09 06:37 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12418490/C4012-13.patch
against trunk revision 811572.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/18/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/18/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/18/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/18/console

This message is automatically generated.


Chris Douglas added a comment - 08/Sep/09 05:13 AM
Added a unit test for LineReader/SplittableCompressionCodec to TestCodec.

Chris Douglas made changes - 08/Sep/09 05:13 AM
Attachment C4012-14.patch [ 12418879 ]
Chris Douglas made changes - 08/Sep/09 05:13 AM
Status Patch Available [ 10002 ] Open [ 1 ]
Chris Douglas made changes - 08/Sep/09 05:13 AM
Status Open [ 1 ] Patch Available [ 10002 ]
Chris Douglas added a comment - 08/Sep/09 05:21 AM
It should be noted that "Spli tt ableCompressionCodec" chooses a different spelling than FileInputFormat::isSpli t able. Popular search engines return more results for the former than the latter, but if anyone cares about consistency on this point then now's the time to object.

Hadoop QA added a comment - 08/Sep/09 05:35 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12418879/C4012-14.patch
against trunk revision 812317.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/23/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/23/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/23/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/23/console

This message is automatically generated.


Repository Revision Date User Message
ASF #813581 Thu Sep 10 20:51:48 UTC 2009 cdouglas HADOOP-4012. Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer
Files Changed
MODIFY /hadoop/common/trunk/CHANGES.txt
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/bzip2/BZip2Constants.java
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/BlockDecompressorStream.java
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/DecompressorStream.java
ADD /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/SplittableCompressionCodec.java
MODIFY /hadoop/common/trunk/src/test/core/org/apache/hadoop/io/compress/TestCodec.java
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/CompressionInputStream.java
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/BZip2Codec.java
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/fs/FSInputChecker.java
ADD /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/SplitCompressionInputStream.java
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/bzip2/CBZip2InputStream.java
MODIFY /hadoop/common/trunk/src/java/org/apache/hadoop/io/compress/GzipCodec.java

Chris Douglas added a comment - 10/Sep/09 08:52 PM
+1

I committed this. Thanks, Abdul!


Chris Douglas made changes - 10/Sep/09 08:52 PM
Resolution Fixed [ 1 ]
Hadoop Flags [Reviewed]
Status Patch Available [ 10002 ] Resolved [ 5 ]
Repository Revision Date User Message
ASF #813585 Thu Sep 10 20:58:34 UTC 2009 cdouglas HADOOP-4012. Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer
Files Changed
MODIFY /hadoop/mapreduce/trunk/lib/hadoop-core-0.21.0-dev.jar
MODIFY /hadoop/mapreduce/trunk/lib/hadoop-core-test-0.21.0-dev.jar

Repository Revision Date User Message
ASF #813587 Thu Sep 10 20:58:49 UTC 2009 cdouglas HADOOP-4012. Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer
Files Changed
MODIFY /hadoop/hdfs/trunk/lib/hadoop-core-test-0.21.0-dev.jar
MODIFY /hadoop/hdfs/trunk/lib/hadoop-core-0.21.0-dev.jar

Hudson added a comment - 10/Sep/09 09:09 PM
Integrated in Hadoop-Common-trunk-Commit #25 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/25/)
. Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer

Hudson added a comment - 10/Sep/09 09:31 PM
Integrated in Hadoop-Mapreduce-trunk-Commit #29 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/29/)
. Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer

Hudson added a comment - 10/Sep/09 09:58 PM
Integrated in Hadoop-Hdfs-trunk-Commit #26 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/26/)
. Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer

Hudson added a comment - 11/Sep/09 11:11 AM
Integrated in Hadoop-Common-trunk #92 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/92/)
. Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer

Hudson added a comment - 11/Sep/09 01:53 PM
Integrated in Hadoop-Hdfs-trunk #80 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/80/)
. Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer

Hudson added a comment - 14/Sep/09 09:05 PM
Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #26 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/26/)

Hudson added a comment - 15/Sep/09 02:25 AM
Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #6 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/6/)
. Provide splitting support for bzip2 compressed files. Contributed by Abdul Qadeer

Robert Chansler added a comment - 08/Oct/09 04:04 AM
Editorial pass over all release notes prior to publication of 0.21.

Robert Chansler made changes - 08/Oct/09 04:04 AM
Release Note Support to process Hadoop split BZip2 files. BZip2 files can now be split.