Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-830

Providing BZip2 splitting support for Text data

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Splitting support for BZip2 Text data

      Description

      HADOOP-4012 (https://issues.apache.org/jira/browse/HADOOP-4012) is providing support to handle BZip2 compressed data such that the input compressed file is split at arbitrary points. This JIRA uses that functionality in LineRecordReader. The benefit of this work is that, if user provides compressed BZip2 Text data, it will be split by Hadoop and hence will be processed by multiple mappers. So BZip2 compressed data will be able to fully utilize the cluster power. Currently BZip2 compressed Text file goes to one mapper and is not split. So the enhancement in this JIRA provides splitting support and a considerable performance gains.

      1. MapReduce-830-version1.patch
        10 kB
        Abdul Qadeer
      2. M830-2.patch
        11 kB
        Chris Douglas
      3. M830-3.patch
        28 kB
        Chris Douglas
      4. M830-4.patch
        28 kB
        Chris Douglas
      5. M830-4.patch
        28 kB
        Chris Douglas

        Issue Links

          Activity

          Hide
          Abdul Qadeer added a comment -

          This patch will only compile once HADOOP-4012 is committed and the respective jar files from common is copied ot the lib folder of MapReduce project.

          Show
          Abdul Qadeer added a comment - This patch will only compile once HADOOP-4012 is committed and the respective jar files from common is copied ot the lib folder of MapReduce project.
          Hide
          Chris Douglas added a comment -

          (related comments in HADOOP-4012)

          • Though it's not changed in bzip, since getEnd is part of the API, it should be called in LineRecordReader.
          • Since the codec has state, the API demands that LineRecordReader synchronize on the codec before creating a splittable stream and calling getStart and getEnd to avoid race conditions (unless a better solution is found in HADOOP-4012)
          • The default dir for unit tests is usually "/tmp", not "."
          Show
          Chris Douglas added a comment - (related comments in HADOOP-4012 ) Though it's not changed in bzip, since getEnd is part of the API, it should be called in LineRecordReader . Since the codec has state, the API demands that LineRecordReader synchronize on the codec before creating a splittable stream and calling getStart and getEnd to avoid race conditions (unless a better solution is found in HADOOP-4012 ) The default dir for unit tests is usually "/tmp", not "."
          Hide
          Chris Douglas added a comment -

          Corresponding changes in 4012-12 reflected here, including merge with MAPREDUCE-773

          Show
          Chris Douglas added a comment - Corresponding changes in 4012-12 reflected here, including merge with MAPREDUCE-773
          Hide
          Chris Douglas added a comment -
          • Fixed mapreduce.lib.input.LineRecordReader (I missed the filePosition updates in the last patch)
          • Added a unit test for the mapreduce code
          • Patched KeyValueLineRecordReader::isSplittable in mapred and mapreduce
          Show
          Chris Douglas added a comment - Fixed mapreduce.lib.input.LineRecordReader (I missed the filePosition updates in the last patch) Added a unit test for the mapreduce code Patched KeyValueLineRecordReader::isSplittable in mapred and mapreduce
          Hide
          Chris Douglas added a comment -

          (also includes a workaround for MAPREDUCE-959, which was getting irritating, and updates the unit tests to JUnit4 semantics)

          Show
          Chris Douglas added a comment - (also includes a workaround for MAPREDUCE-959 , which was getting irritating, and updates the unit tests to JUnit4 semantics)
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12418869/M830-3.patch
          against trunk revision 813585.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The patch appears to cause tar ant target to fail.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/24/testReport/
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/24/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/24/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418869/M830-3.patch against trunk revision 813585. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/24/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/24/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/24/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          Fixed copy/paste bug

          Show
          Chris Douglas added a comment - Fixed copy/paste bug
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12419221/M830-4.patch
          against trunk revision 813585.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/58/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12419221/M830-4.patch against trunk revision 813585. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/58/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          *grumble* --no-prefix *grumble*

          Show
          Chris Douglas added a comment - *grumble* --no-prefix *grumble*
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12419222/M830-4.patch
          against trunk revision 813585.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/59/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/59/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/59/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/59/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12419222/M830-4.patch against trunk revision 813585. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/59/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/59/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/59/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/59/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          +1

          I committed this. Thanks, Abdul!

          Show
          Chris Douglas added a comment - +1 I committed this. Thanks, Abdul!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #30 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/30/)
          . Add support for splittable compression to TextInputFormats. Contributed by Abdul Qadeer

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #30 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/30/ ) . Add support for splittable compression to TextInputFormats. Contributed by Abdul Qadeer
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #27 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/27/)
          . Add support for splittable compression to TextInputFormats. Contributed by Abdul Qadeer

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #27 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/27/ ) . Add support for splittable compression to TextInputFormats. Contributed by Abdul Qadeer
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #80 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/80/)
          . Add support for splittable compression to TextInputFormats. Contributed by Abdul Qadeer

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #80 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/80/ ) . Add support for splittable compression to TextInputFormats. Contributed by Abdul Qadeer
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #26 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/26/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #26 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/26/ )
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #6 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/6/)
          . Add support for splittable compression to TextInputFormats. Contributed by Abdul Qadeer

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #6 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/6/ ) . Add support for splittable compression to TextInputFormats. Contributed by Abdul Qadeer

            People

            • Assignee:
              Abdul Qadeer
              Reporter:
              Abdul Qadeer
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development