Issue Details (XML | Word | Printable)

Key: HADOOP-3226
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Chris Douglas
Reporter: Chris Douglas
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Run combiner when merging spills from map output

Created: 10/Apr/08 06:36 AM   Updated: 22/Aug/08 07:50 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.18.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 3226-0.patch 2008-04-10 06:42 AM Chris Douglas 2 kB
Text File Licensed for inclusion in ASF works 3226-1.patch 2008-04-22 06:46 PM Chris Douglas 15 kB
Text File Licensed for inclusion in ASF works 3226-2.patch 2008-04-25 12:04 AM Chris Douglas 15 kB
Text File Licensed for inclusion in ASF works 3226-3.patch 2008-04-29 09:33 PM Chris Douglas 13 kB

Hadoop Flags: Reviewed, Incompatible change
Release Note:
Changed policy for running combiner. The combiner may be run multiple times as the map's output is sorted and merged. Additionally, it may be run on the reduce side as data is merged. The old semantics are available in Hadoop 0.18 if the user calls:
job.setCombineOnlyOnce(true);
Resolution Date: 07/May/08 09:55 PM


 Description  « Hide
When merging spills from the map, running the combiner should further diminish the volume of data we send to the reduce.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Chris Douglas added a comment - 10/Apr/08 06:40 AM
This patch is a first pass. It simply takes the RawValueIterator from the merge and runs the combiner.

Owen O'Malley added a comment - 10/Apr/08 07:31 AM
Either in this patch or a similar one, we should also run the combiner on the reduce input merge spills.

Owen O'Malley added a comment - 17/Apr/08 04:43 PM
resubmitting to hudson.

Hadoop QA added a comment - 17/Apr/08 06:15 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12379803/3226-0.patch
against trunk revision 645773.

@author +1. The patch does not contain any @author tags.

tests included -1. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2265/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2265/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2265/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2265/console

This message is automatically generated.


Chris Douglas added a comment - 22/Apr/08 06:46 PM - edited
This patch adds a run of the combiner to the reduce-side spills. It also runs the combiner on the map side merge if there are more than min.num.spills.for.combine (6 by default). It adds no new test cases because it changes no behavior and should be covered by existing mapred test cases.

Hadoop QA added a comment - 23/Apr/08 07:02 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12380717/3226-1.patch
against trunk revision 645773.

@author +1. The patch does not contain any @author tags.

tests included -1. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2303/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2303/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2303/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2303/console

This message is automatically generated.


Chris Douglas added a comment - 25/Apr/08 12:04 AM
After talking with Owen, changed the default min number of spills from 6 to 3

Hadoop QA added a comment - 25/Apr/08 03:23 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12380884/3226-2.patch
against trunk revision 645773.

@author +1. The patch does not contain any @author tags.

tests included -1. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2321/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2321/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2321/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2321/console

This message is automatically generated.


Owen O'Malley added a comment - 29/Apr/08 08:56 PM
I think we should get rid of the spill versus merge combiner input/output record counters. I think it would just confuse users over the distinction between them.

Chris Douglas added a comment - 29/Apr/08 09:33 PM
Removed separate merge counters and now-unnecessary CombineOutputCollector::setCounter()

Hadoop QA added a comment - 29/Apr/08 10:58 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381141/3226-3.patch
against trunk revision 645773.

@author +1. The patch does not contain any @author tags.

tests included -1. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2342/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2342/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2342/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2342/console

This message is automatically generated.


Chris Douglas added a comment - 06/May/08 11:07 PM
No tests are updated or included, as the existing tests will verify correctness of the results and the new functionality is both difficult to test and deviations from it are not necessarily incorrect.

Owen O'Malley added a comment - 07/May/08 09:55 PM
I just committed this with a few fixes for trunk breakage. Thanks, Chris!

Hudson added a comment - 08/May/08 12:23 PM