Hadoop Common
  1. Hadoop Common
  2. HADOOP-2399

Input key and value to combiner and reducer should be reused

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.15.1
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      The key and value objects that are given to the Combiner and Reducer are now reused between calls. This is much more efficient, but the user can not assume the objects are constant.

      Description

      Currently, the input key and value are recreated on every iteration for input to the combiner and reducer. It would speed up the system substantially if we reused the keys and values. The down side of doing it, is that it may break applications that count on holding references to previous keys and values, but I think it is worth doing.

      1. 2399-4.patch
        16 kB
        Owen O'Malley
      2. 2399-3.patch
        15 kB
        Owen O'Malley

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open In Progress In Progress
          43d 18h 14m 1 Owen O'Malley 23/Jan/08 17:46
          In Progress In Progress Patch Available Patch Available
          27d 3h 48m 1 Owen O'Malley 19/Feb/08 21:34
          Patch Available Patch Available Open Open
          12d 15h 39m 1 Devaraj Das 03/Mar/08 13:14
          Open Open Patch Available Patch Available
          7d 10h 29m 1 Owen O'Malley 10/Mar/08 23:43
          Patch Available Patch Available Resolved Resolved
          2d 4h 30m 1 Devaraj Das 13/Mar/08 04:14
          Resolved Resolved Closed Closed
          69d 15h 51m 1 Nigel Daley 21/May/08 21:05
          Owen O'Malley made changes -
          Component/s mapred [ 12310690 ]
          Hide
          Owen O'Malley added a comment -

          This Jira was marked as an incompatible change because it did change the semantics. However, without this change there was an allocation (and later garbage collection) for every key and value passed to the reduce, which had measurable performance costs.

          Show
          Owen O'Malley added a comment - This Jira was marked as an incompatible change because it did change the semantics. However, without this change there was an allocation (and later garbage collection) for every key and value passed to the reduce, which had measurable performance costs.
          Hide
          Ralf Gutsche added a comment -

          This piece of code will print different output with Hadoop 17 (compared to Hadoop 16).

          public void reduce(... Iterator<Writable> aValues...) throws IOException {
          ArrayList<Writable> ret = new ArrayList<Writable>();

          System.out.println("First");
          while (aValues.hasNext())

          { Writable val = aValues.next(); System.out.println(val.toString()); ret.add(val); }

          System.out.println("Second");
          for(Writable w: ret)

          { System.out.println(w.toString()); }


          }

          In Hadoop 16, the values printed after First and Second were the same.
          In Hadoop 17, the values printed after First are identical to Hadoop 16. However, in Hadoop 17, all the records printed after Second are identical.
          Adding a clone (ret.add(val.cone())) will fix this, if the clone is implemented correctly.

          I guess this is the consequence of this JIRA.

          Show
          Ralf Gutsche added a comment - This piece of code will print different output with Hadoop 17 (compared to Hadoop 16). public void reduce(... Iterator<Writable> aValues...) throws IOException { ArrayList<Writable> ret = new ArrayList<Writable>(); System.out.println("First"); while (aValues.hasNext()) { Writable val = aValues.next(); System.out.println(val.toString()); ret.add(val); } System.out.println("Second"); for(Writable w: ret) { System.out.println(w.toString()); } } In Hadoop 16, the values printed after First and Second were the same. In Hadoop 17, the values printed after First are identical to Hadoop 16. However, in Hadoop 17, all the records printed after Second are identical. Adding a clone (ret.add(val.cone())) will fix this, if the clone is implemented correctly. I guess this is the consequence of this JIRA.
          Nigel Daley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Owen O'Malley made changes -
          Release Note The key and value objects that are given to the Combiner and Reducer are now reused between calls. This is much more efficient, but the user can not assume the objects are constant.
          Hadoop Flags [Incompatible change, Reviewed]
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #444 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/444/ )
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #427 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/427/ )
          Tom White made changes -
          Link This issue relates to HADOOP-2997 [ HADOOP-2997 ]
          Devaraj Das made changes -
          Resolution Fixed [ 1 ]
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hide
          Devaraj Das added a comment -

          I just committed this. Thanks, Owen!

          Show
          Devaraj Das added a comment - I just committed this. Thanks, Owen!
          Hide
          Owen O'Malley added a comment -

          The test failure is in HDFS, which I didn't change at all.

          Show
          Owen O'Malley added a comment - The test failure is in HDFS, which I didn't change at all.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12377559/2399-4.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 8 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12377559/2399-4.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 8 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1939/console This message is automatically generated.
          Owen O'Malley made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Owen O'Malley made changes -
          Attachment 2399-4.patch [ 12377559 ]
          Hide
          Owen O'Malley added a comment -

          This patch brings it up to the current trunk.

          Show
          Owen O'Malley added a comment - This patch brings it up to the current trunk.
          Owen O'Malley made changes -
          Attachment reuse-obj.patch [ 12372937 ]
          Owen O'Malley made changes -
          Attachment reuse-obj-2.patch [ 12375629 ]
          Devaraj Das made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Devaraj Das added a comment -

          This got affected by HADOOP-1986. Could you please regenerate the patch? Thanks!

          Show
          Devaraj Das added a comment - This got affected by HADOOP-1986 . Could you please regenerate the patch? Thanks!
          Hide
          Devaraj Das added a comment -

          +1 (although it would be nice to have benchmark figures)

          Show
          Devaraj Das added a comment - +1 (although it would be nice to have benchmark figures)
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12375966/2399-3.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 8 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12375966/2399-3.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 8 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1817/console This message is automatically generated.
          Hide
          Runping Qi added a comment -

          +1

          The change wrt UniqValueCount class looks fine.

          Show
          Runping Qi added a comment - +1 The change wrt UniqValueCount class looks fine.
          Owen O'Malley made changes -
          Status In Progress [ 3 ] Patch Available [ 10002 ]
          Fix Version/s 0.17.0 [ 12312913 ]
          Owen O'Malley made changes -
          Attachment 2399-3.patch [ 12375966 ]
          Hide
          Owen O'Malley added a comment -

          This patch fixes the value iterator to reuse the key and value between iterations. Aggregation was assuming that the reduce inputs where not reused, so I stringified the value. Is that ok, Runping? I got a minor speed up of 2:33 instead of 2:37 on a simple 1 node word count.

          Show
          Owen O'Malley added a comment - This patch fixes the value iterator to reuse the key and value between iterations. Aggregation was assuming that the reduce inputs where not reused, so I stringified the value. Is that ok, Runping? I got a minor speed up of 2:33 instead of 2:37 on a simple 1 node word count.
          Owen O'Malley made changes -
          Attachment reuse-obj-2.patch [ 12375629 ]
          Hide
          Owen O'Malley added a comment -

          This patch changes the ValueIterator to have 2 instances of the key, and 1 instance of the object and reuse the objects during the iteration. I also fixed some of the compiler warnings for unbound generic types in the ValueIterator.

          Show
          Owen O'Malley added a comment - This patch changes the ValueIterator to have 2 instances of the key, and 1 instance of the object and reuse the objects during the iteration. I also fixed some of the compiler warnings for unbound generic types in the ValueIterator.
          Owen O'Malley made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Nigel Daley made changes -
          Fix Version/s 0.16.0 [ 12312740 ]
          Owen O'Malley made changes -
          Field Original Value New Value
          Attachment reuse-obj.patch [ 12372937 ]
          Hide
          Owen O'Malley added a comment -

          This is a rough patch that isn't quite working yet.

          Show
          Owen O'Malley added a comment - This is a rough patch that isn't quite working yet.
          Hide
          Doug Cutting added a comment -

          +1

          As a general rule, I think applications should not expect to be able to hold on to pointers to objects passed to them, but should expect to be able to hold on to pointers returned to them. Lots of exceptions of course, but, in this case, I don't think applications should be expecting to be able to hold on to these objects, and so any that break if we reuse them were not well written.

          These were originally reused. Reuse was removed when the combiner was added, since the original combiner kept pointers to the objects.

          Show
          Doug Cutting added a comment - +1 As a general rule, I think applications should not expect to be able to hold on to pointers to objects passed to them, but should expect to be able to hold on to pointers returned to them. Lots of exceptions of course, but, in this case, I don't think applications should be expecting to be able to hold on to these objects, and so any that break if we reuse them were not well written. These were originally reused. Reuse was removed when the combiner was added, since the original combiner kept pointers to the objects.
          Owen O'Malley created issue -

            People

            • Assignee:
              Owen O'Malley
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development