Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Later
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: io
    • Labels:
      None
    • Target Version/s:
    • Release Note:
      Make Gzipped input splittable by offering a tradeoff between "Spent resources" and "Wall clock time"

      Description

      Files compressed with the gzip codec are not splittable due to the nature of the codec.
      This limits the options you have scaling out when reading large gzipped input files.

      Given the fact that gunzipping a 1GiB file usually takes only 2 minutes I figured that for some use cases wasting some resources may result in a shorter job time under certain conditions.
      So reading the entire input file from the start for each split (wasting resources!!) may lead to additional scalability.

      1. HADOOP-7076-2011-12-09-branch-0.22.patch
        40 kB
        Niels Basjes
      2. HADOOP-7076-2011-12-09.patch
        41 kB
        Niels Basjes
      3. HADOOP-7076-branch-0.22.patch
        40 kB
        Niels Basjes
      4. HADOOP-7076-2011-12-04-2332.patch
        40 kB
        Niels Basjes
      5. HADOOP-7076-2011-08-05-2315.patch
        43 kB
        Niels Basjes
      6. HADOOP-7076-2011-08-05-2255.patch
        6 kB
        Niels Basjes
      7. HADOOP-7076-2011-05-18.patch
        43 kB
        Niels Basjes
      8. HADOOP-7076-2011-02-06.patch
        42 kB
        Niels Basjes
      9. HADOOP-7076-2011-02-05.patch
        42 kB
        Niels Basjes
      10. HADOOP-7076-2011-01-29.patch
        41 kB
        Niels Basjes
      11. HADOOP-7076-2011-01-26.patch
        40 kB
        Niels Basjes
      12. HADOOP-7076.patch
        40 kB
        Niels Basjes

        Issue Links

          Activity

          Hide
          Niels Basjes added a comment -

          This is the implementation I did to make this idea possible.

          Show
          Niels Basjes added a comment - This is the implementation I did to make this idea possible.
          Hide
          Niels Basjes added a comment -

          The patch

          Show
          Niels Basjes added a comment - The patch
          Hide
          Niels Basjes added a comment -

          I found a bug I need to fix first.

          Show
          Niels Basjes added a comment - I found a bug I need to fix first.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12466874/HADOOP-7076.patch
          against trunk revision 1051659.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The patch appears to cause tar ant target to fail.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:

          -1 contrib tests. The patch failed contrib unit tests.

          -1 system test framework. The patch failed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/149//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/149//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/149//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12466874/HADOOP-7076.patch against trunk revision 1051659. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: -1 contrib tests. The patch failed contrib unit tests. -1 system test framework. The patch failed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/149//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/149//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/149//console This message is automatically generated.
          Hide
          Greg Roelofs added a comment -

          See also PIG-42. That died when Pig stopped doing compression itself, but the approach may be worth considering.

          Alternatively, an LZO-style side index could be generated for any concatenated gzip stream.

          Show
          Greg Roelofs added a comment - See also PIG-42 . That died when Pig stopped doing compression itself, but the approach may be worth considering. Alternatively, an LZO-style side index could be generated for any concatenated gzip stream.
          Hide
          Niels Basjes added a comment -

          This patch now passes all unit tests.

          Changes present in this file:

          • Added additional method getBytesRead() to the org.apache.hadoop.io.compress.Decompressor interface to be able to query the position of the underlying file.
          • Added the option to decrease the blocksize used by the DecompressorStream to read the disk file and feed the decompressor (Needed to get the required accuracy).
          • Added SplittableGzipCodec that allows splitting Gzipped input files.
          • Added TestSplittableCodecSeams that tests if all the splits are seamless: No duplicate records and no missing records.
          • Fixes several bugs in TestCodec.java
          • Reset of decompressor
          • Writing an number in a binary form into a file that is later read and parsed as a text file (now all textual)
          • Naming : no more "Splitable" in the touched unit test files.
          Show
          Niels Basjes added a comment - This patch now passes all unit tests. Changes present in this file: Added additional method getBytesRead() to the org.apache.hadoop.io.compress.Decompressor interface to be able to query the position of the underlying file. Added the option to decrease the blocksize used by the DecompressorStream to read the disk file and feed the decompressor (Needed to get the required accuracy). Added SplittableGzipCodec that allows splitting Gzipped input files. Added TestSplittableCodecSeams that tests if all the splits are seamless: No duplicate records and no missing records. Fixes several bugs in TestCodec.java Reset of decompressor Writing an number in a binary form into a file that is later read and parsed as a text file (now all textual) Naming : no more "Splitable" in the touched unit test files.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12467585/HADOOP-7076.patch
          against trunk revision 1055206.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 1049 javac compiler warnings (more than the trunk's current 1048 warnings).

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          -1 release audit. The applied patch generated 2 release audit warnings (more than the trunk's current 1 warnings).

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/159//testReport/
          Release audit warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/159//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/159//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/159//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12467585/HADOOP-7076.patch against trunk revision 1055206. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 1049 javac compiler warnings (more than the trunk's current 1048 warnings). +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. -1 release audit. The applied patch generated 2 release audit warnings (more than the trunk's current 1 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/159//testReport/ Release audit warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/159//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/159//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/159//console This message is automatically generated.
          Hide
          Niels Basjes added a comment -

          Fixed two minor issues:

          • added copyright notice
          • remove the usage of a deprecated method
          Show
          Niels Basjes added a comment - Fixed two minor issues: added copyright notice remove the usage of a deprecated method
          Hide
          Jakob Homan added a comment -

          Canceling for new patch.

          Show
          Jakob Homan added a comment - Canceling for new patch.
          Hide
          Jakob Homan added a comment -

          re-triggering Hudson on new patch.

          Show
          Jakob Homan added a comment - re-triggering Hudson on new patch.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12467672/HADOOP-7076.patch
          against trunk revision 1056006.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/161//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/161//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/161//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12467672/HADOOP-7076.patch against trunk revision 1056006. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/161//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/161//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/161//console This message is automatically generated.
          Hide
          Niels Basjes added a comment -

          Added releasenote

          Show
          Niels Basjes added a comment - Added releasenote
          Hide
          Niels Basjes added a comment -

          I got some valuable review comments from Chris Douglas which I've processed in this new version.

          Summary of the changes:

          • The only existing file that is changed is the unit test class TestCodec (fixed bugs + reused unit tests).
          • No other changes to any existing classes or interfaces (so only new files).
          • Refactored the Split Seams test to allow any compressed text file as test input.
          • Tested with synthetic (part of unit test) and several of my own real logfiles.
          • Tested with fixed line lengths (as small as 1 byte) and variable (random) line lengths
          • Tested with both a normal and a file that is really a concatenation of a lot of very small gzip files.
          • Tested with both the built in and the native gzip implementation.
          Show
          Niels Basjes added a comment - I got some valuable review comments from Chris Douglas which I've processed in this new version. Summary of the changes: The only existing file that is changed is the unit test class TestCodec (fixed bugs + reused unit tests). No other changes to any existing classes or interfaces (so only new files). Refactored the Split Seams test to allow any compressed text file as test input. Tested with synthetic (part of unit test) and several of my own real logfiles. Tested with fixed line lengths (as small as 1 byte) and variable (random) line lengths Tested with both a normal and a file that is really a concatenation of a lot of very small gzip files. Tested with both the built in and the native gzip implementation.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12469492/HADOOP-7076-2011-01-26.patch
          against trunk revision 1063613.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/200//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/200//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/200//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12469492/HADOOP-7076-2011-01-26.patch against trunk revision 1063613. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/200//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/200//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/200//console This message is automatically generated.
          Hide
          Niels Basjes added a comment -

          Changes compared to previous patch

          • Some minor changes
          • Updated the documentation to what the code now really does.
          Show
          Niels Basjes added a comment - Changes compared to previous patch Some minor changes Updated the documentation to what the code now really does.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12469755/HADOOP-7076-2011-01-29.patch
          against trunk revision 1064919.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/209//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/209//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/209//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12469755/HADOOP-7076-2011-01-29.patch against trunk revision 1064919. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/209//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/209//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/209//console This message is automatically generated.
          Hide
          Niels Basjes added a comment -

          To clarify the bugs I fixed in the existing compression tests:
          1) The generated test file starts with a line number. In the original version this line number is done BINARY and then the file is read as ASCII records with line endings as separator. I'm surprised the test actually worked in the original form. I changed this to ASCII all the way.
          2) The decompressor is reused. But the decompressor must be reset before it can be reused.

          Show
          Niels Basjes added a comment - To clarify the bugs I fixed in the existing compression tests: 1) The generated test file starts with a line number. In the original version this line number is done BINARY and then the file is read as ASCII records with line endings as separator. I'm surprised the test actually worked in the original form. I changed this to ASCII all the way. 2) The decompressor is reused. But the decompressor must be reset before it can be reused.
          Hide
          Niels Basjes added a comment -

          Changes compared to previous version:

          • Removed commented test code
          • Improved code readability (layout + some fixed values are now static final fields)
          • Improved javadoc
          • Minor code changes to fix some PMD issues.
          Show
          Niels Basjes added a comment - Changes compared to previous version: Removed commented test code Improved code readability (layout + some fixed values are now static final fields) Improved javadoc Minor code changes to fix some PMD issues.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12470362/HADOOP-7076-2011-02-05.patch
          against trunk revision 1066284.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/219//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/219//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/219//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12470362/HADOOP-7076-2011-02-05.patch against trunk revision 1066284. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/219//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/219//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/219//console This message is automatically generated.
          Hide
          Niels Basjes added a comment -

          Fixed a javadoc error that slipped trough.

          Show
          Niels Basjes added a comment - Fixed a javadoc error that slipped trough.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12470415/HADOOP-7076-2011-02-06.patch
          against trunk revision 1066284.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/220//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/220//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/220//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12470415/HADOOP-7076-2011-02-06.patch against trunk revision 1066284. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/220//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/220//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/220//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12470415/HADOOP-7076-2011-02-06.patch
          against trunk revision 1071364.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/250//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/250//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/250//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12470415/HADOOP-7076-2011-02-06.patch against trunk revision 1071364. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/250//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/250//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/250//console This message is automatically generated.
          Hide
          Niels Basjes added a comment -

          This patch has no code changes compared to the previous one. Only the Javadoc has been improved to provide additional suggestions for optimal usage of this patch.

          Show
          Niels Basjes added a comment - This patch has no code changes compared to the previous one. Only the Javadoc has been improved to provide additional suggestions for optimal usage of this patch.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479577/HADOOP-7076-2011-05-18.patch
          against trunk revision 1104426.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/467//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/467//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/467//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479577/HADOOP-7076-2011-05-18.patch against trunk revision 1104426. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/467//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/467//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/467//console This message is automatically generated.
          Hide
          Niels Basjes added a comment -

          As requested by Luke Lu I've refactored the patch to match the new MVN structure.

          Show
          Niels Basjes added a comment - As requested by Luke Lu I've refactored the patch to match the new MVN structure.
          Hide
          Niels Basjes added a comment -

          Previous patch was very incomplete.

          Show
          Niels Basjes added a comment - Previous patch was very incomplete.
          Hide
          Niels Basjes added a comment -

          Resubmitting patch now that Jenkins seems working again.

          Show
          Niels Basjes added a comment - Resubmitting patch now that Jenkins seems working again.
          Hide
          Niels Basjes added a comment -

          Updated the patch to match the new source tree.

          Show
          Niels Basjes added a comment - Updated the patch to match the new source tree.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12506062/HADOOP-7076-2011-12-04-2332.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 9 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/432//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/432//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12506062/HADOOP-7076-2011-12-04-2332.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 9 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/432//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/432//console This message is automatically generated.
          Hide
          Eli Collins added a comment -

          The javadoc warnings are unrelated. Filed HADOOP-7881.

          Show
          Eli Collins added a comment - The javadoc warnings are unrelated. Filed HADOOP-7881 .
          Hide
          Niels Basjes added a comment -

          Contains the same changes as HADOOP-7076-2011-12-04-2332.patch.
          The only differences between the two files are filenames and offsets within files.

          Show
          Niels Basjes added a comment - Contains the same changes as HADOOP-7076 -2011-12-04-2332.patch. The only differences between the two files are filenames and offsets within files.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12506182/HADOOP-7076-branch-0.22.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/435//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12506182/HADOOP-7076-branch-0.22.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/435//console This message is automatically generated.
          Hide
          Luke Lu added a comment -

          @Niels, can I assume the HADOOP-7076-2011-12-04-2332.patch applies to both trunk and 0.23 branch? For the 0.22 branch patch, the only difference seems to be the directory layout?

          Show
          Luke Lu added a comment - @Niels, can I assume the HADOOP-7076 -2011-12-04-2332.patch applies to both trunk and 0.23 branch? For the 0.22 branch patch, the only difference seems to be the directory layout?
          Hide
          Niels Basjes added a comment -

          @Luke,yes that is correct.

          Show
          Niels Basjes added a comment - @Luke,yes that is correct.
          Hide
          Luke Lu added a comment -

          The code especially the docs/comments looks good to me. I wonder if the name of the codec would better be called SkipGzipCodec instead of SplittableGzipCodec mostly because the latter would be a good name for a real splittable variant of gzip format and that the former sounds weird enough to prompt user to read the documentation to find out that the codec actually do O(s*n) io, which is mostly suitable for processing archived gzipped files infrequently with number of splits less than the compression factor (uncompressed size/compressed size) (not a precise criterion BTW). Otherwise, you'd better off convert these files into a real splittable compressed format.

          Show
          Luke Lu added a comment - The code especially the docs/comments looks good to me. I wonder if the name of the codec would better be called SkipGzipCodec instead of SplittableGzipCodec mostly because the latter would be a good name for a real splittable variant of gzip format and that the former sounds weird enough to prompt user to read the documentation to find out that the codec actually do O(s*n) io, which is mostly suitable for processing archived gzipped files infrequently with number of splits less than the compression factor (uncompressed size/compressed size) (not a precise criterion BTW). Otherwise, you'd better off convert these files into a real splittable compressed format.
          Hide
          Niels Basjes added a comment -

          @Luke: Thanks for your feedback.

          Regarding the class name;
          The other direction of making Gzip input files splittable (HADOOP-6153 ... seems quite dead at this moment) is called "RAGzip" (Random Access Gzip) and looks like it was implemented as an extension within the regular GzipCodec class.

          Because my implementation is based upon the GzipCodec class and the SplittableCompressionCodec interface I chose the most sensible name I could think of: SplittableGzipCodec.

          This codec will and should be disabled by default. The only way you can enable it is by reading the documentation and following the instructions described there. This way I think users are confronted with the things to consider when using this; including the alternative approaches to processing the data in parallel. So from that point I do not see the benefit of the different name.

          Overall I still prefer the classname that is in the patch at this moment: SplittableGzipCodec

          Do you agree?

          Show
          Niels Basjes added a comment - @Luke: Thanks for your feedback. Regarding the class name; The other direction of making Gzip input files splittable ( HADOOP-6153 ... seems quite dead at this moment) is called "RAGzip" (Random Access Gzip) and looks like it was implemented as an extension within the regular GzipCodec class. Because my implementation is based upon the GzipCodec class and the SplittableCompressionCodec interface I chose the most sensible name I could think of: SplittableGzipCodec. This codec will and should be disabled by default. The only way you can enable it is by reading the documentation and following the instructions described there. This way I think users are confronted with the things to consider when using this; including the alternative approaches to processing the data in parallel. So from that point I do not see the benefit of the different name. Overall I still prefer the classname that is in the patch at this moment: SplittableGzipCodec Do you agree?
          Hide
          Niels Basjes added a comment -

          I talked to someone else and reconsidered the point of the classname that was made by Luke. Because there is now way to asses if someone else find a way to create a real splittable gzip codec I've changed the name of my class to SkipSeekSplittableGzipCodec
          The idea is that its clear that the intent is to make gzip splittable yet a workaround is in place that skips data and seeks out the start of the split.

          New patch for branch-0.22 follows shortly.

          Show
          Niels Basjes added a comment - I talked to someone else and reconsidered the point of the classname that was made by Luke. Because there is now way to asses if someone else find a way to create a real splittable gzip codec I've changed the name of my class to SkipSeekSplittableGzipCodec The idea is that its clear that the intent is to make gzip splittable yet a workaround is in place that skips data and seeks out the start of the split. New patch for branch-0.22 follows shortly.
          Hide
          Niels Basjes added a comment -

          This is the same patch as HADOOP-7076-2011-12-09.patch but this one has been changed for branch-0.22

          This one will fail the automated Hadoop QA because it is non-trunk.

          Show
          Niels Basjes added a comment - This is the same patch as HADOOP-7076 -2011-12-09.patch but this one has been changed for branch-0.22 This one will fail the automated Hadoop QA because it is non-trunk.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12506791/HADOOP-7076-2011-12-09.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 5 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/459//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/459//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12506791/HADOOP-7076-2011-12-09.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 5 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/459//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/459//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12506793/HADOOP-7076-2011-12-09-branch-0.22.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/460//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12506793/HADOOP-7076-2011-12-09-branch-0.22.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/460//console This message is automatically generated.
          Hide
          Tim Broberg added a comment -

          See Hadoop-7909 for a nascent effort to make gzip fully splittable.

          Show
          Tim Broberg added a comment - See Hadoop-7909 for a nascent effort to make gzip fully splittable.
          Hide
          Niels Basjes added a comment -

          As discussed in the mailing list: I've turned this feature into a separate jar you can add to an existing Hadoop installation.
          See: https://github.com/nielsbasjes/splittablegzip

          Show
          Niels Basjes added a comment - As discussed in the mailing list: I've turned this feature into a separate jar you can add to an existing Hadoop installation. See: https://github.com/nielsbasjes/splittablegzip
          Hide
          Robert Joseph Evans added a comment -

          Niels I am resolving this for bookkeeping. If you feel that this has a large enough following for it to be part of hadoop proper please either reopen this JIRA or file a new one.

          Show
          Robert Joseph Evans added a comment - Niels I am resolving this for bookkeeping. If you feel that this has a large enough following for it to be part of hadoop proper please either reopen this JIRA or file a new one.

            People

            • Assignee:
              Niels Basjes
              Reporter:
              Niels Basjes
            • Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development