Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1535

Wrong comparator used to merge files in Reduce phase

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.12.3, 0.13.0
    • 0.14.0
    • None
    • None

    Description

      As per the fix for HADOOP-485, we allow users to optionally provide a different comparator to group values when calling the user's Reduce function. Devaraj and I were looking at the code yesterday and we found that in ReduceTask.java, we use the user-supplied comparator to merge the output files from the Map tasks (we use the user-supplied comparator when creating a new SequenceFile.Sorter object). This is incorrect as the comparator used to merge Map output files should be the same as that used to create those files in the Map phase. The user-supplied comparator for grouping values should be used only in the iterator passed to the user's Reduce function (which is done correctly in the code).

      Attachments

        1. 1535_01.patch
          2 kB
          Vivek Ratan
        2. 1535_02.patch
          24 kB
          Vivek Ratan

        Activity

          ddas Devaraj Das added a comment -

          Devaraj and I were looking at the code yesterday and we found that in ReduceTask.java, we use the user-supplied comparator to merge the output files from the Map tasks (we use the user-supplied comparator when creating a new SequenceFile.Sorter object). This is incorrect as the comparator used to merge Map output files should be the same as that used to create those files in the Map phase.

          A small clarification - we use the map output key comparator for sorting map outputs and the same comparator must be used for merging them (on the reducer side). Also, we should continue to use the map output key comparator for iterating through the key/value records from the (possibly merged) map outputs; we should use the value-grouping comparator to only decide whether the current key we are looking at is "equal" to the last key that was looked at, while the user's reducer method is iterating through the values for a key.

          ddas Devaraj Das added a comment - Devaraj and I were looking at the code yesterday and we found that in ReduceTask.java, we use the user-supplied comparator to merge the output files from the Map tasks (we use the user-supplied comparator when creating a new SequenceFile.Sorter object). This is incorrect as the comparator used to merge Map output files should be the same as that used to create those files in the Map phase. A small clarification - we use the map output key comparator for sorting map outputs and the same comparator must be used for merging them (on the reducer side). Also, we should continue to use the map output key comparator for iterating through the key/value records from the (possibly merged) map outputs; we should use the value-grouping comparator to only decide whether the current key we are looking at is "equal" to the last key that was looked at, while the user's reducer method is iterating through the values for a key.
          vivekr Vivek Ratan added a comment -

          A related issue: During the Map phase, when we sort key-value pairs, we use the comparator returned by JobConf.getOutputKeyComparator() (in BasicTypeSorterBase::configure()). When we merge files (in MapOutputBuffer::mergeParts()), we use the comparator returned by 'new WritableComparator(keyClass)' (in SequenceFile::Sorter::Sorter()). This is not right, as the exact same comparator should be used both during sort and during merge (as well as during merge in the Reduce phase). There can be situations when JobConf.getOutputKeyComparator() and WritableComparator(keyClass) return different comparators.

          vivekr Vivek Ratan added a comment - A related issue: During the Map phase, when we sort key-value pairs, we use the comparator returned by JobConf.getOutputKeyComparator() (in BasicTypeSorterBase::configure()). When we merge files (in MapOutputBuffer::mergeParts()), we use the comparator returned by 'new WritableComparator(keyClass)' (in SequenceFile::Sorter::Sorter()). This is not right, as the exact same comparator should be used both during sort and during merge (as well as during merge in the Reduce phase). There can be situations when JobConf.getOutputKeyComparator() and WritableComparator(keyClass) return different comparators.
          vivekr Vivek Ratan added a comment -

          We use the comparator returned by JobConf.getOutputKeyComparator() for the sort/merge phases of Map and Reduce. We use the comparator returned by JobConf.getOutputValueGroupingComparator() for the iterator across values for a given key. See 1535_01.patch.

          vivekr Vivek Ratan added a comment - We use the comparator returned by JobConf.getOutputKeyComparator() for the sort/merge phases of Map and Reduce. We use the comparator returned by JobConf.getOutputValueGroupingComparator() for the iterator across values for a given key. See 1535_01.patch.
          ddas Devaraj Das added a comment -

          Pls update the patch to stick to the 80 column boundary for the lines there.

          ddas Devaraj Das added a comment - Pls update the patch to stick to the 80 column boundary for the lines there.
          vivekr Vivek Ratan added a comment -

          The attached patch (1535_02.patch) has the code changes, formatted to the 80-char column limit. I also added a set of new unit test cases. There is a new test file TestComparators.java, under src/test/org/apache/hadoop/mapred. This file has 4 tests to check various combinations of default and user-supplied comparators. One of the tests is the same as that in TestUserValueGrouping.java. That files needs to be deleted and TestComparators.java needs to be added to the repository.

          vivekr Vivek Ratan added a comment - The attached patch (1535_02.patch) has the code changes, formatted to the 80-char column limit. I also added a set of new unit test cases. There is a new test file TestComparators.java, under src/test/org/apache/hadoop/mapred. This file has 4 tests to check various combinations of default and user-supplied comparators. One of the tests is the same as that in TestUserValueGrouping.java. That files needs to be deleted and TestComparators.java needs to be added to the repository.
          vivekr Vivek Ratan added a comment -

          As per Runping's comments, we don't need this functionality right away.

          vivekr Vivek Ratan added a comment - As per Runping's comments, we don't need this functionality right away.
          vivekr Vivek Ratan added a comment -

          Sorry, resolved the wrong bug. This one's still open.

          vivekr Vivek Ratan added a comment - Sorry, resolved the wrong bug. This one's still open.
          ddas Devaraj Das added a comment -

          +1

          ddas Devaraj Das added a comment - +1
          omalley Owen O'Malley added a comment -

          I just committed this. Thanks Vivek!

          omalley Owen O'Malley added a comment - I just committed this. Thanks Vivek!
          hudson Hudson added a comment -
          hudson Hudson added a comment - Integrated in Hadoop-Nightly #154 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/154/ )

          People

            vivekr Vivek Ratan
            vivekr Vivek Ratan
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: