[HADOOP-1535] Wrong comparator used to merge files in Reduce phase - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.12.3, 0.13.0
Fix Version/s: 0.14.0
Component/s: None
Labels:
None

Description

As per the fix for ~~HADOOP-485~~, we allow users to optionally provide a different comparator to group values when calling the user's Reduce function. Devaraj and I were looking at the code yesterday and we found that in ReduceTask.java, we use the user-supplied comparator to merge the output files from the Map tasks (we use the user-supplied comparator when creating a new SequenceFile.Sorter object). This is incorrect as the comparator used to merge Map output files should be the same as that used to create those files in the Map phase. The user-supplied comparator for grouping values should be used only in the iterator passed to the user's Reduce function (which is done correctly in the code).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

1535_01.patch
28/Jun/07 09:39
2 kB
Vivek Ratan
1535_02.patch
10/Jul/07 21:37
24 kB
Vivek Ratan

Activity

Ascending order - Click to sort in descending order

Devaraj Das added a comment - 27/Jun/07 05:42

Devaraj and I were looking at the code yesterday and we found that in ReduceTask.java, we use the user-supplied comparator to merge the output files from the Map tasks (we use the user-supplied comparator when creating a new SequenceFile.Sorter object). This is incorrect as the comparator used to merge Map output files should be the same as that used to create those files in the Map phase.

A small clarification - we use the map output key comparator for sorting map outputs and the same comparator must be used for merging them (on the reducer side). Also, we should continue to use the map output key comparator for iterating through the key/value records from the (possibly merged) map outputs; we should use the value-grouping comparator to only decide whether the current key we are looking at is "equal" to the last key that was looked at, while the user's reducer method is iterating through the values for a key.

Devaraj Das added a comment - 27/Jun/07 05:42 Devaraj and I were looking at the code yesterday and we found that in ReduceTask.java, we use the user-supplied comparator to merge the output files from the Map tasks (we use the user-supplied comparator when creating a new SequenceFile.Sorter object). This is incorrect as the comparator used to merge Map output files should be the same as that used to create those files in the Map phase. A small clarification - we use the map output key comparator for sorting map outputs and the same comparator must be used for merging them (on the reducer side). Also, we should continue to use the map output key comparator for iterating through the key/value records from the (possibly merged) map outputs; we should use the value-grouping comparator to only decide whether the current key we are looking at is "equal" to the last key that was looked at, while the user's reducer method is iterating through the values for a key.

Vivek Ratan added a comment - 27/Jun/07 12:12

A related issue: During the Map phase, when we sort key-value pairs, we use the comparator returned by JobConf.getOutputKeyComparator() (in BasicTypeSorterBase::configure()). When we merge files (in MapOutputBuffer::mergeParts()), we use the comparator returned by 'new WritableComparator(keyClass)' (in SequenceFile::Sorter::Sorter()). This is not right, as the exact same comparator should be used both during sort and during merge (as well as during merge in the Reduce phase). There can be situations when JobConf.getOutputKeyComparator() and WritableComparator(keyClass) return different comparators.

Vivek Ratan added a comment - 27/Jun/07 12:12 A related issue: During the Map phase, when we sort key-value pairs, we use the comparator returned by JobConf.getOutputKeyComparator() (in BasicTypeSorterBase::configure()). When we merge files (in MapOutputBuffer::mergeParts()), we use the comparator returned by 'new WritableComparator(keyClass)' (in SequenceFile::Sorter::Sorter()). This is not right, as the exact same comparator should be used both during sort and during merge (as well as during merge in the Reduce phase). There can be situations when JobConf.getOutputKeyComparator() and WritableComparator(keyClass) return different comparators.

Vivek Ratan added a comment - 28/Jun/07 09:39

We use the comparator returned by JobConf.getOutputKeyComparator() for the sort/merge phases of Map and Reduce. We use the comparator returned by JobConf.getOutputValueGroupingComparator() for the iterator across values for a given key. See 1535_01.patch.

Vivek Ratan added a comment - 28/Jun/07 09:39 We use the comparator returned by JobConf.getOutputKeyComparator() for the sort/merge phases of Map and Reduce. We use the comparator returned by JobConf.getOutputValueGroupingComparator() for the iterator across values for a given key. See 1535_01.patch.

Devaraj Das added a comment - 28/Jun/07 10:16

Pls update the patch to stick to the 80 column boundary for the lines there.

Devaraj Das added a comment - 28/Jun/07 10:16 Pls update the patch to stick to the 80 column boundary for the lines there.

Vivek Ratan added a comment - 10/Jul/07 21:37

The attached patch (1535_02.patch) has the code changes, formatted to the 80-char column limit. I also added a set of new unit test cases. There is a new test file TestComparators.java, under src/test/org/apache/hadoop/mapred. This file has 4 tests to check various combinations of default and user-supplied comparators. One of the tests is the same as that in TestUserValueGrouping.java. That files needs to be deleted and TestComparators.java needs to be added to the repository.

Vivek Ratan added a comment - 10/Jul/07 21:37 The attached patch (1535_02.patch) has the code changes, formatted to the 80-char column limit. I also added a set of new unit test cases. There is a new test file TestComparators.java, under src/test/org/apache/hadoop/mapred. This file has 4 tests to check various combinations of default and user-supplied comparators. One of the tests is the same as that in TestUserValueGrouping.java. That files needs to be deleted and TestComparators.java needs to be added to the repository.

Vivek Ratan added a comment - 11/Jul/07 00:13

As per Runping's comments, we don't need this functionality right away.

Vivek Ratan added a comment - 11/Jul/07 00:13 As per Runping's comments, we don't need this functionality right away.

Vivek Ratan added a comment - 11/Jul/07 00:13

Sorry, resolved the wrong bug. This one's still open.

Vivek Ratan added a comment - 11/Jul/07 00:13 Sorry, resolved the wrong bug. This one's still open.

Devaraj Das added a comment - 11/Jul/07 09:01

Devaraj Das added a comment - 11/Jul/07 09:01 +1

Owen O'Malley added a comment - 12/Jul/07 16:40

I just committed this. Thanks Vivek!

Owen O'Malley added a comment - 12/Jul/07 16:40 I just committed this. Thanks Vivek!

Hudson added a comment - 13/Jul/07 11:43

Integrated in Hadoop-Nightly #154 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/154/)

Hudson added a comment - 13/Jul/07 11:43 Integrated in Hadoop-Nightly #154 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/154/ )

People

Assignee:: Vivek Ratan

Reporter:: Vivek Ratan

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 27/Jun/07 04:56

Updated:: 08/Jul/09 16:52

Resolved:: 12/Jul/07 16:40

Hadoop Common

Details

Description

Attachments

Attachments

Activity

People

Dates