Issue Details (XML | Word | Printable)

Key: MAPREDUCE-830
Type: Improvement Improvement
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Abdul Qadeer
Reporter: Abdul Qadeer
Votes: 0
Watchers: 5
Operations

If you were logged in you would be able to see more operations.
Hadoop Map/Reduce

Providing BZip2 splitting support for Text data

Created: 06/Aug/09 07:38 PM   Updated: 15/Sep/09 02:25 AM
Return to search
Component/s: None
Affects Version/s: 0.21.0
Fix Version/s: 0.21.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works M830-2.patch 2009-09-02 12:07 AM Chris Douglas 11 kB
Text File Licensed for inclusion in ASF works M830-3.patch 2009-09-08 02:11 AM Chris Douglas 28 kB
Text File Licensed for inclusion in ASF works M830-4.patch 2009-09-10 09:59 PM Chris Douglas 28 kB
Text File Licensed for inclusion in ASF works M830-4.patch 2009-09-10 09:47 PM Chris Douglas 28 kB
Text File Licensed for inclusion in ASF works MapReduce-830-version1.patch 2009-08-06 07:44 PM Abdul Qadeer 10 kB
Issue Links:
Blocker
 

Hadoop Flags: Reviewed
Release Note: Splitting support for BZip2 Text data
Resolution Date: 11/Sep/09 03:31 AM


 Description  « Hide
HADOOP-4012 (https://issues.apache.org/jira/browse/HADOOP-4012) is providing support to handle BZip2 compressed data such that the input compressed file is split at arbitrary points. This JIRA uses that functionality in LineRecordReader. The benefit of this work is that, if user provides compressed BZip2 Text data, it will be split by Hadoop and hence will be processed by multiple mappers. So BZip2 compressed data will be able to fully utilize the cluster power. Currently BZip2 compressed Text file goes to one mapper and is not split. So the enhancement in this JIRA provides splitting support and a considerable performance gains.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Abdul Qadeer added a comment - 06/Aug/09 07:44 PM
This patch will only compile once HADOOP-4012 is committed and the respective jar files from common is copied ot the lib folder of MapReduce project.

Chris Douglas added a comment - 30/Aug/09 09:47 PM
(related comments in HADOOP-4012)
  • Though it's not changed in bzip, since getEnd is part of the API, it should be called in LineRecordReader.
  • Since the codec has state, the API demands that LineRecordReader synchronize on the codec before creating a splittable stream and calling getStart and getEnd to avoid race conditions (unless a better solution is found in HADOOP-4012)
  • The default dir for unit tests is usually "/tmp", not "."

Chris Douglas added a comment - 02/Sep/09 12:07 AM
Corresponding changes in 4012-12 reflected here, including merge with MAPREDUCE-773

Chris Douglas added a comment - 08/Sep/09 02:11 AM
  • Fixed mapreduce.lib.input.LineRecordReader (I missed the filePosition updates in the last patch)
  • Added a unit test for the mapreduce code
  • Patched KeyValueLineRecordReader::isSplittable in mapred and mapreduce

Chris Douglas added a comment - 08/Sep/09 02:12 AM
(also includes a workaround for MAPREDUCE-959, which was getting irritating, and updates the unit tests to JUnit4 semantics)

Hadoop QA added a comment - 10/Sep/09 09:38 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12418869/M830-3.patch
against trunk revision 813585.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 6 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

-1 javac. The patch appears to cause tar ant target to fail.

-1 findbugs. The patch appears to cause Findbugs to fail.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/24/testReport/
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/24/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/24/console

This message is automatically generated.


Chris Douglas added a comment - 10/Sep/09 09:47 PM
Fixed copy/paste bug

Hadoop QA added a comment - 10/Sep/09 09:57 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12419221/M830-4.patch
against trunk revision 813585.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 6 new or modified tests.

-1 patch. The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/58/console

This message is automatically generated.


Chris Douglas added a comment - 10/Sep/09 09:59 PM
*grumble* --no-prefix *grumble*

Hadoop QA added a comment - 11/Sep/09 12:53 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12419222/M830-4.patch
against trunk revision 813585.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 6 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/59/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/59/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/59/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/59/console

This message is automatically generated.


Chris Douglas added a comment - 11/Sep/09 03:31 AM
+1

I committed this. Thanks, Abdul!


Hudson added a comment - 11/Sep/09 04:13 AM
Integrated in Hadoop-Mapreduce-trunk-Commit #30 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/30/)
. Add support for splittable compression to TextInputFormats. Contributed by Abdul Qadeer

Hudson added a comment - 11/Sep/09 04:47 AM
Integrated in Hadoop-Hdfs-trunk-Commit #27 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/27/)
. Add support for splittable compression to TextInputFormats. Contributed by Abdul Qadeer

Hudson added a comment - 11/Sep/09 01:54 PM
Integrated in Hadoop-Hdfs-trunk #80 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/80/)
. Add support for splittable compression to TextInputFormats. Contributed by Abdul Qadeer

Hudson added a comment - 14/Sep/09 09:05 PM
Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #26 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/26/)

Hudson added a comment - 15/Sep/09 02:25 AM
Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #6 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/6/)
. Add support for splittable compression to TextInputFormats. Contributed by Abdul Qadeer