Issue Details (XML | Word | Printable)

Key: HADOOP-5539
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Jothi Padmanabhan
Reporter: Billy Pearson
Votes: 1
Watchers: 7
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

o.a.h.mapred.Merger not maintaining map out compression on intermediate files

Created: 20/Mar/09 06:51 AM   Updated: 08/Jul/09 04:53 PM
Return to search
Component/s: None
Affects Version/s: 0.19.1
Fix Version/s: 0.20.1

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 5539.patch 2009-03-22 08:59 AM Billy Pearson 5 kB
Text File Licensed for inclusion in ASF works hadoop-5539-branch20.patch 2009-05-27 09:39 AM Jothi Padmanabhan 4 kB
Text File Licensed for inclusion in ASF works hadoop-5539-v1.patch 2009-05-21 03:30 AM Jothi Padmanabhan 5 kB
Text File Licensed for inclusion in ASF works hadoop-5539.patch 2009-05-15 07:18 AM Jothi Padmanabhan 6 kB
Environment: 0.19.2-dev, r753365

Hadoop Flags: Reviewed
Resolution Date: 28/May/09 11:37 AM


 Description  « Hide
hadoop-site.xml :
mapred.compress.map.output = true

map output files are compressed but when the in memory merger closes
on the reduce the on disk merger runs to reduce input files to <= io.sort.factor if needed.

when this happens it outputs files called intermediate.x files these
do not maintain compression setting the writer (o.a.h.mapred.Merger.class line 432)
passes the codec but I added some logging and its always null map output compression set true or false.

This causes task to fail if they can not hold the uncompressed size of the data of the reduce its holding
I thank this is just and oversight of the codec not getting set correctly for the on disk merges.

2009-03-20 01:30:30,005 INFO org.apache.hadoop.mapred.Merger: Merging 30 intermediate segments out of a total of 3000
2009-03-20 01:30:30,005 INFO org.apache.hadoop.mapred.Merger: intermediate.1 used codec: null

I added

// added my me
	   if (codec != null){
	     LOG.info("intermediate." + passNo + " used codec: " + codec.toString());
	   } else {
	     LOG.info("intermediate." + passNo + " used codec: Null");
	   }
	   // end added by me

Just before the creation of the writer o.a.h.mapred.Merger.class line 432
and it outputs the second line above.

I have confirmed this with the logging and I have looked at the files on the disk of the tasktracker. I can read the data in
the intermediate files clearly telling me that there not compressed but I can not read the map.out files direct from the map output
telling me the compression is working on the map end but not on the on disk merge that produces the intermediate.

I can see no benefit for these not maintaining the compression setting and as it looks they where intended to maintain it.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Stefan Will added a comment - 20/Mar/09 06:16 PM
I'd like to see this fixed as well since one reason I've enabled map output compression is to reduce disk space usage by the mapreduce framework. It appears that currently the map outputs are simply decompressed as soon as they have been downloaded by the reducer.

Billy Pearson added a comment - 22/Mar/09 04:06 AM
I verified that it is doing the same on map task no intermediate.x file from o.a.h.mapred.Merger are getting compressed.

Billy Pearson added a comment - 22/Mar/09 08:58 AM - edited
this should fix the problem I had to make a few new constructors. I left the old constructors that these files where using because not sure if any other tasks using these. this patch will apply to 0.19-branch I have not worked any on trunk so might need to try dry-run before applying to trunk. tested on my end and working correctly now with this patch.

Billy Pearson made changes - 22/Mar/09 08:58 AM
Field Original Value New Value
Fix Version/s 0.20.0 [ 12313438 ]
Status Open [ 1 ] Patch Available [ 10002 ]
Billy Pearson made changes - 22/Mar/09 08:59 AM
Attachment 5539.patch [ 12403378 ]
Billy Pearson made changes - 22/Mar/09 08:59 AM
Status Patch Available [ 10002 ] Open [ 1 ]
Billy Pearson made changes - 22/Mar/09 08:59 AM
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 22/Mar/09 09:07 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12403378/5539.patch
against trunk revision 756858.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

-1 patch. The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/117/console

This message is automatically generated.


Hadoop QA added a comment - 22/Mar/09 09:12 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12403378/5539.patch
against trunk revision 756858.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

-1 patch. The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/118/console

This message is automatically generated.


Chris Douglas made changes - 22/Mar/09 11:21 PM
Priority Major [ 3 ] Blocker [ 1 ]
Fix Version/s 0.19.2 [ 12313650 ]
Assignee Billy Pearson [ viper799 ]
Billy Pearson added a comment - 24/Mar/09 03:08 AM
added back to read map

Billy Pearson made changes - 24/Mar/09 03:08 AM
Fix Version/s 0.20.0 [ 12313438 ]
Fix Version/s 0.19.2 [ 12313650 ]
Billy Pearson added a comment - 24/Mar/09 03:12 AM - edited
Someone can use my patch as a starting point.

The ReduceTask.java call that is the problem is line 2145
The maptask.java call that is the problem is line 1269

I use streaming without a combiner so that should be looked at also to see if it uses o.a.h.mapred.Merger

the basic problem is the codec is not passed from these function to the merger so its always null the call to
o.a.h.mapred.Merger should include codec somehow if compression is not used then codec is null
in both ReduceTask and Maptask.

I thank this is a major bug that effects all MR jobs with disk bandwidth that uses compression.


Billy Pearson made changes - 24/Mar/09 03:12 AM
Status Patch Available [ 10002 ] Open [ 1 ]
Chris Douglas added a comment - 24/Mar/09 04:06 AM

added back to read map

It's a blocker; it will be resolved and backported to 0.20 at least. The road map isn't; the PA queue defines the set of patches that can be committed. The fix version is usually set when it's actually resolved, so where it was committed is documented.


Chris Douglas made changes - 24/Mar/09 04:06 AM
Fix Version/s 0.19.2 [ 12313650 ]
Fix Version/s 0.20.0 [ 12313438 ]
Status Open [ 1 ] Patch Available [ 10002 ]
Chris Douglas added a comment - 24/Mar/09 05:16 AM
Oh, I see; the patch is for 0.19. My mistake.

Chris Douglas made changes - 24/Mar/09 05:16 AM
Fix Version/s 0.19.2 [ 12313650 ]
Status Patch Available [ 10002 ] Open [ 1 ]
Billy Pearson made changes - 07/May/09 07:27 PM
Assignee Billy Pearson [ viper799 ]
Jothi Padmanabhan added a comment - 12/May/09 08:48 AM
The patch looks good. A few minor points:
  1. The new MergeQueue constructor could call the existing constructor and then set the codec later.
    public MergeQueue(Configuration conf, FileSystem fs, 
                List<Segment<K, V>> segments, RawComparator<K> comparator,
                Progressable reporter, boolean sortSegments, CompressionCodec codec) {
              this(conf, fs, segments, comparator, reporter, sortSegments);
              this.codec = codec;
            }
  1. For the new merge methods, should we place the Codec argument after the valueClass argument (instead of being the last argument) to maintain consistency with the other method that does take the codec argument?

Would you be able to provide patches for trunk and 20-branch as well?


Billy Pearson added a comment - 12/May/09 07:01 PM
I got to many thing going on right now to make a new patch fill free to mod my patch to work the way you want and use it to build a patch for trunk I would like to see this fixed in 0.20.1 if at all possible. this will be the one thing holding me up from upgrading to hbase 0.20 when it becomes ready.

Jothi Padmanabhan added a comment - 15/May/09 07:18 AM
Updated the patch to trunk

Jothi Padmanabhan made changes - 15/May/09 07:18 AM
Attachment hadoop-5539.patch [ 12408232 ]
Jothi Padmanabhan made changes - 15/May/09 07:18 AM
Status Open [ 1 ] Patch Available [ 10002 ]
Jothi Padmanabhan added a comment - 18/May/09 11:14 AM
Could somebody review this patch? Thanks.

Ravi Gummadi added a comment - 19/May/09 06:09 PM
Patch looks good.

This patch clashes with HADOOP-5572. Need to update this patch once HADOOP-5572 gets committed.

HADOOP-5572 changes mergeParts() to call merge() with boolean sortSegments ----- this avoids one new signature of merge() from your patch.


Hadoop QA added a comment - 19/May/09 08:51 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12408232/hadoop-5539.patch
against trunk revision 776352.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/358/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/358/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/358/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/358/console

This message is automatically generated.


Jothi Padmanabhan added a comment - 21/May/09 03:30 AM
Patch updated to trunk

Jothi Padmanabhan made changes - 21/May/09 03:30 AM
Attachment hadoop-5539-v1.patch [ 12408652 ]
Jothi Padmanabhan made changes - 21/May/09 03:30 AM
Status Patch Available [ 10002 ] Open [ 1 ]
Jothi Padmanabhan made changes - 21/May/09 03:30 AM
Status Open [ 1 ] Patch Available [ 10002 ]
Ravi Gummadi added a comment - 21/May/09 11:18 AM
Patch looks good.
+1

Hadoop QA added a comment - 23/May/09 10:35 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12408652/hadoop-5539-v1.patch
against trunk revision 777761.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/386/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/386/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/386/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/386/console

This message is automatically generated.


Jothi Padmanabhan added a comment - 27/May/09 09:39 AM
Patch for the 20 branch

Jothi Padmanabhan made changes - 27/May/09 09:39 AM
Attachment hadoop-5539-branch20.patch [ 12409146 ]
Repository Revision Date User Message
ASF #779571 Thu May 28 11:33:56 UTC 2009 ddas HADOOP-5539. Fixes a problem to do with not preserving intermediate output compression for merged data. Contributed by Jothi Padmanabhan and Billy Pearson.
Files Changed
MODIFY /hadoop/core/trunk/CHANGES.txt
MODIFY /hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/MapTask.java
MODIFY /hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/ReduceTask.java
MODIFY /hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/Merger.java

Repository Revision Date User Message
ASF #779572 Thu May 28 11:37:05 UTC 2009 ddas HADOOP-5539. Fixes a problem to do with not preserving intermediate output compression for merged data. Contributed by Jothi Padmanabhan and Billy Pearson.
Files Changed
MODIFY /hadoop/core/branches/branch-0.20/CHANGES.txt
MODIFY /hadoop/core/branches/branch-0.20/src/mapred/org/apache/hadoop/mapred/MapTask.java
MODIFY /hadoop/core/branches/branch-0.20/src/mapred/org/apache/hadoop/mapred/ReduceTask.java
MODIFY /hadoop/core/branches/branch-0.20/src/mapred/org/apache/hadoop/mapred/Merger.java

Devaraj Das added a comment - 28/May/09 11:37 AM
I just committed this. Thanks Jothi and Billy!

Devaraj Das made changes - 28/May/09 11:37 AM
Resolution Fixed [ 1 ]
Fix Version/s 0.19.2 [ 12313650 ]
Fix Version/s 0.20.1 [ 12313866 ]
Assignee Jothi Padmanabhan [ jothipn ]
Hadoop Flags [Reviewed]
Status Patch Available [ 10002 ] Resolved [ 5 ]
Nigel Daley added a comment - 28/May/09 03:11 PM
Why no unit test? Why no javadoc for new methods?

If you tested this manually, what steps did you perform?


Billy Pearson added a comment - 28/May/09 04:15 PM
no commit for 0.19 branch?

Jothi Padmanabhan added a comment - 29/May/09 03:29 AM

Why no unit test? If you tested this manually, what steps did you perform?

It is pretty difficult to write a unit test for this patch as this patch just enables compression during intermediate merges. The files that are created during the intermediate merges are consumed soon after they are created and the final merged file was compressed even without this patch. I did the same test as Billy had done – add print statements in the framework code (Merger.java) to verify if compression was turned on during intermediate merges.

Why no javadoc for new methods?

The newly added methods are in Merger, which is a mapred package private class

no commit for 0.19 branch?

Billy, from this comment https://issues.apache.org/jira/browse/HADOOP-5539?focusedCommentId=12708570&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12708570, we thought you needed this only for 0.20. If you need it for 0.19 branch as well, I can generate a patch for that too.


Billy Pearson added a comment - 29/May/09 04:37 AM
No I do not need it my version patch with my original patch for 0.19 but other might sense there is still a lot of older version in production that will update to 0.19 branch now that it has a few minor releases on it.

Hudson added a comment - 11/Jun/09 07:59 PM

Owen O'Malley made changes - 08/Jul/09 04:53 PM
Component/s mapred [ 12310690 ]