Hadoop Common
  1. Hadoop Common
  2. HADOOP-2399

Input key and value to combiner and reducer should be reused

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.15.1
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      The key and value objects that are given to the Combiner and Reducer are now reused between calls. This is much more efficient, but the user can not assume the objects are constant.

      Description

      Currently, the input key and value are recreated on every iteration for input to the combiner and reducer. It would speed up the system substantially if we reused the keys and values. The down side of doing it, is that it may break applications that count on holding references to previous keys and values, but I think it is worth doing.

      1. 2399-3.patch
        15 kB
        Owen O'Malley
      2. 2399-4.patch
        16 kB
        Owen O'Malley

        Issue Links

          Activity

          Hide
          Owen O'Malley added a comment -

          This Jira was marked as an incompatible change because it did change the semantics. However, without this change there was an allocation (and later garbage collection) for every key and value passed to the reduce, which had measurable performance costs.

          Show
          Owen O'Malley added a comment - This Jira was marked as an incompatible change because it did change the semantics. However, without this change there was an allocation (and later garbage collection) for every key and value passed to the reduce, which had measurable performance costs.
          Hide
          Ralf Gutsche added a comment -

          This piece of code will print different output with Hadoop 17 (compared to Hadoop 16).

          public void reduce(... Iterator<Writable> aValues...) throws IOException {
          ArrayList<Writable> ret = new ArrayList<Writable>();

          System.out.println("First");
          while (aValues.hasNext())

          { Writable val = aValues.next(); System.out.println(val.toString()); ret.add(val); }

          System.out.println("Second");
          for(Writable w: ret)

          { System.out.println(w.toString()); }


          }

          In Hadoop 16, the values printed after First and Second were the same.
          In Hadoop 17, the values printed after First are identical to Hadoop 16. However, in Hadoop 17, all the records printed after Second are identical.
          Adding a clone (ret.add(val.cone())) will fix this, if the clone is implemented correctly.

          I guess this is the consequence of this JIRA.

          Show
          Ralf Gutsche added a comment - This piece of code will print different output with Hadoop 17 (compared to Hadoop 16). public void reduce(... Iterator<Writable> aValues...) throws IOException { ArrayList<Writable> ret = new ArrayList<Writable>(); System.out.println("First"); while (aValues.hasNext()) { Writable val = aValues.next(); System.out.println(val.toString()); ret.add(val); } System.out.println("Second"); for(Writable w: ret) { System.out.println(w.toString()); } } In Hadoop 16, the values printed after First and Second were the same. In Hadoop 17, the values printed after First are identical to Hadoop 16. However, in Hadoop 17, all the records printed after Second are identical. Adding a clone (ret.add(val.cone())) will fix this, if the clone is implemented correctly. I guess this is the consequence of this JIRA.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #444 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/444/ )
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #427 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/427/ )
          Hide
          Devaraj Das added a comment -

          I just committed this. Thanks, Owen!

          Show
          Devaraj Das added a comment - I just committed this. Thanks, Owen!
          Hide
          Owen O'Malley added a comment -

          The test failure is in HDFS, which I didn't change at all.

          Show
          Owen O'Malley added a comment - The test failure is in HDFS, which I didn't change at all.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12377559/2399-4.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 8 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12377559/2399-4.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 8 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/console This message is automatically generated.
          Hide
          Owen O'Malley added a comment -

          This patch brings it up to the current trunk.

          Show
          Owen O'Malley added a comment - This patch brings it up to the current trunk.
          Hide
          Devaraj Das added a comment -

          This got affected by HADOOP-1986. Could you please regenerate the patch? Thanks!

          Show
          Devaraj Das added a comment - This got affected by HADOOP-1986 . Could you please regenerate the patch? Thanks!
          Hide
          Devaraj Das added a comment -

          +1 (although it would be nice to have benchmark figures)

          Show
          Devaraj Das added a comment - +1 (although it would be nice to have benchmark figures)
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12375966/2399-3.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 8 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12375966/2399-3.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 8 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/console This message is automatically generated.
          Hide
          Runping Qi added a comment -

          +1

          The change wrt UniqValueCount class looks fine.

          Show
          Runping Qi added a comment - +1 The change wrt UniqValueCount class looks fine.
          Hide
          Owen O'Malley added a comment -

          This patch fixes the value iterator to reuse the key and value between iterations. Aggregation was assuming that the reduce inputs where not reused, so I stringified the value. Is that ok, Runping? I got a minor speed up of 2:33 instead of 2:37 on a simple 1 node word count.

          Show
          Owen O'Malley added a comment - This patch fixes the value iterator to reuse the key and value between iterations. Aggregation was assuming that the reduce inputs where not reused, so I stringified the value. Is that ok, Runping? I got a minor speed up of 2:33 instead of 2:37 on a simple 1 node word count.
          Hide
          Owen O'Malley added a comment -

          This patch changes the ValueIterator to have 2 instances of the key, and 1 instance of the object and reuse the objects during the iteration. I also fixed some of the compiler warnings for unbound generic types in the ValueIterator.

          Show
          Owen O'Malley added a comment - This patch changes the ValueIterator to have 2 instances of the key, and 1 instance of the object and reuse the objects during the iteration. I also fixed some of the compiler warnings for unbound generic types in the ValueIterator.
          Hide
          Owen O'Malley added a comment -

          This is a rough patch that isn't quite working yet.

          Show
          Owen O'Malley added a comment - This is a rough patch that isn't quite working yet.
          Hide
          Doug Cutting added a comment -

          +1

          As a general rule, I think applications should not expect to be able to hold on to pointers to objects passed to them, but should expect to be able to hold on to pointers returned to them. Lots of exceptions of course, but, in this case, I don't think applications should be expecting to be able to hold on to these objects, and so any that break if we reuse them were not well written.

          These were originally reused. Reuse was removed when the combiner was added, since the original combiner kept pointers to the objects.

          Show
          Doug Cutting added a comment - +1 As a general rule, I think applications should not expect to be able to hold on to pointers to objects passed to them, but should expect to be able to hold on to pointers returned to them. Lots of exceptions of course, but, in this case, I don't think applications should be expecting to be able to hold on to these objects, and so any that break if we reuse them were not well written. These were originally reused. Reuse was removed when the combiner was added, since the original combiner kept pointers to the objects.

            People

            • Assignee:
              Owen O'Malley
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development