MRUnit
  1. MRUnit
  2. MRUNIT-129

Key object re-use in Reducer is inconsistent with MapReduce behaviour (new API)

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 1.1.0
    • Labels:
      None

      Description

      Follow on issue from MRUNIT-127, also relating to grouping comparators.

      In the new MapReduce API Reducer's reduce method, you are passed a reference to a key and an Iterable for the values. What's not particularly clear is that as you iterate through the values, the value of the key is also changing silently. This is not usually noticeable, since the keys that were grouped often have the same value anyway, but with a custom grouping comparator this isn't necessarily the case.

      This is different to the old API, where you would get the first key in the group and this wouldn't change as you iterate over the values.

      MRUnit uses the "old API" style behaviour for both old and new APIs, so some unit tests may not give a consistent result to an actual mapreduce job.

      1. JIRA_129.patch
        55 kB
        Paulin Sanselme
      2. MRUNIT-129.unittest.patch
        4 kB
        Dave Beech

        Activity

        Hide
        Dave Beech added a comment -

        Uploaded a patch with a failing unit test which demonstrates this problem.

        Show
        Dave Beech added a comment - Uploaded a patch with a failing unit test which demonstrates this problem.
        Hide
        Paulin Sanselme added a comment -

        Hi,
        I've done a patch to fix this issue.

        I've generated the patch from trunk-hadoop1 branch.

        To summarize :

        • The reducer inputs are now List<KeyValueReuseList<K,V>>.
        • KeyValueReuseList<K,V> is a List<Pair<K,V>> which provide an Iterator<V> with the right behaviour (instance reuse + silent key update).
        • I've add a ReduceFeeder class mainly used to do the sorting and grouping of entries (map outputs).

        Paulin

        Show
        Paulin Sanselme added a comment - Hi, I've done a patch to fix this issue. I've generated the patch from trunk-hadoop1 branch. To summarize : The reducer inputs are now List<KeyValueReuseList<K,V>>. KeyValueReuseList<K,V> is a List<Pair<K,V>> which provide an Iterator<V> with the right behaviour (instance reuse + silent key update). I've add a ReduceFeeder class mainly used to do the sorting and grouping of entries (map outputs). Paulin
        Hide
        Brock Noland added a comment -

        It's odd your patch is not showing up... I just made you a contributor, can you try again?

        Show
        Brock Noland added a comment - It's odd your patch is not showing up... I just made you a contributor, can you try again?
        Hide
        Paulin Sanselme added a comment - - edited

        I've updated it with some javadoc and more UT. It should be ok now.

        Show
        Paulin Sanselme added a comment - - edited I've updated it with some javadoc and more UT. It should be ok now.
        Hide
        Brock Noland added a comment -

        Ok, it's a big patch but most of it looks like tests. Unless someone else has time to review, I'll give it a try next week.

        Show
        Brock Noland added a comment - Ok, it's a big patch but most of it looks like tests. Unless someone else has time to review, I'll give it a try next week.
        Hide
        Paulin Sanselme added a comment -

        Did you have some time to review this patch ? Does it need some changes ?

        Show
        Paulin Sanselme added a comment - Did you have some time to review this patch ? Does it need some changes ?
        Hide
        Paulin Sanselme added a comment -

        It seems not many people are interested in this wrong behavior of mrunit.

        Show
        Paulin Sanselme added a comment - It seems not many people are interested in this wrong behavior of mrunit.
        Hide
        Brock Noland added a comment -

        Hey,

        In the ReduceDriver should the methods updateInput and sortAndGroup be public?

        Show
        Brock Noland added a comment - Hey, In the ReduceDriver should the methods updateInput and sortAndGroup be public?
        Hide
        Paulin Sanselme added a comment -

        updateInput ans sortAndGroup are methods of ReduceFeeder.

        ReduceFeeder provides methods for producing easily some reducer inputs in the new format.

        • updateInput is used internally to keep compatibility with the old way of using mrunit. It is public to provide an easy way to update tests where we have only one key and multiple values.
        • sortAndGroup is used internally in MapReduceDriver to sort and group the map outputs before giving them to the reducer. It is public to provide an easy way of testing a Reducer alone with more complex inputs.
        Show
        Paulin Sanselme added a comment - updateInput ans sortAndGroup are methods of ReduceFeeder. ReduceFeeder provides methods for producing easily some reducer inputs in the new format. updateInput is used internally to keep compatibility with the old way of using mrunit. It is public to provide an easy way to update tests where we have only one key and multiple values. sortAndGroup is used internally in MapReduceDriver to sort and group the map outputs before giving them to the reducer. It is public to provide an easy way of testing a Reducer alone with more complex inputs.
        Hide
        Brock Noland added a comment -

        OK I see. It looks good to me! I'll wait a day or so to see if anyone else has feedback.

        Show
        Brock Noland added a comment - OK I see. It looks good to me! I'll wait a day or so to see if anyone else has feedback.
        Hide
        Brock Noland added a comment -

        Thanks for the patch Paul! I commited this to trunk and trunk-hadoop1.

        Show
        Brock Noland added a comment - Thanks for the patch Paul! I commited this to trunk and trunk-hadoop1.
        Hide
        Hudson added a comment -

        Integrated in mrunit-trunk #881 (See https://builds.apache.org/job/mrunit-trunk/881/)
        MRUNIT-129: Key object re-use in Reducer is inconsistent with MapReduce behaviour (new API) (Revision 08bbdfd5295e617f956f31937976f7b75ed60378)

        Result = SUCCESS
        brock : https://git-wip-us.apache.org/repos/asf?p=mrunit.git&a=commit&h=08bbdfd5295e617f956f31937976f7b75ed60378
        Files :

        • src/main/java/org/apache/hadoop/mrunit/mapreduce/ReduceDriver.java
        • src/test/java/org/apache/hadoop/mrunit/mapreduce/TestReducerInputValueReuse.java
        • src/test/java/org/apache/hadoop/mrunit/mapreduce/TestReduceDriver.java
        • src/test/java/org/apache/hadoop/mrunit/mapreduce/TestReduceFeeder.java
        • src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java
        • src/main/java/org/apache/hadoop/mrunit/types/KeyValueReuseList.java
        • src/main/java/org/apache/hadoop/mrunit/mapreduce/ReduceFeeder.java
        • src/main/java/org/apache/hadoop/mrunit/mapreduce/MapReduceDriver.java
        • src/main/java/org/apache/hadoop/mrunit/TestDriver.java
        • src/test/java/org/apache/hadoop/mrunit/internal/util/TestStringUtils.java
        • src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockReduceContextWrapper.java
        • src/test/java/org/apache/hadoop/mrunit/types/TestKeyValueReuseList.java
        • src/main/java/org/apache/hadoop/mrunit/internal/util/StringUtils.java
        Show
        Hudson added a comment - Integrated in mrunit-trunk #881 (See https://builds.apache.org/job/mrunit-trunk/881/ ) MRUNIT-129 : Key object re-use in Reducer is inconsistent with MapReduce behaviour (new API) (Revision 08bbdfd5295e617f956f31937976f7b75ed60378) Result = SUCCESS brock : https://git-wip-us.apache.org/repos/asf?p=mrunit.git&a=commit&h=08bbdfd5295e617f956f31937976f7b75ed60378 Files : src/main/java/org/apache/hadoop/mrunit/mapreduce/ReduceDriver.java src/test/java/org/apache/hadoop/mrunit/mapreduce/TestReducerInputValueReuse.java src/test/java/org/apache/hadoop/mrunit/mapreduce/TestReduceDriver.java src/test/java/org/apache/hadoop/mrunit/mapreduce/TestReduceFeeder.java src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java src/main/java/org/apache/hadoop/mrunit/types/KeyValueReuseList.java src/main/java/org/apache/hadoop/mrunit/mapreduce/ReduceFeeder.java src/main/java/org/apache/hadoop/mrunit/mapreduce/MapReduceDriver.java src/main/java/org/apache/hadoop/mrunit/TestDriver.java src/test/java/org/apache/hadoop/mrunit/internal/util/TestStringUtils.java src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockReduceContextWrapper.java src/test/java/org/apache/hadoop/mrunit/types/TestKeyValueReuseList.java src/main/java/org/apache/hadoop/mrunit/internal/util/StringUtils.java

          People

          • Assignee:
            Paulin Sanselme
            Reporter:
            Dave Beech
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development