Pig
  1. Pig
  2. PIG-2228

support partial aggregation in map task

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      In map aggregation will aggregate records before it sends them to combiner. It reduces the serializing/deserializing costs in use of combiner, by sending fewer records to the combiner.
      It has been shown to speed up the map task in a group by query by up to 50%.
      Since this is a very new feature, it currently turned off by default. To turn it on, set the property pig.exec.mapPartAgg to true.

      If the group-by keys used for grouping don't result in sufficient reduction in number of records, the performance might be worse with this feature turned on. To prevent that from happening, the feature turns itself off if the reduction in records sent to combiner is not more than a configurable threshold. This threshold can be set using the property pig.exec.mapPartAgg.minReduction . It is set to a default value of 10, which means that the number of records that get sent to the combiner should be reduced by a factor of 10.
      Show
      In map aggregation will aggregate records before it sends them to combiner. It reduces the serializing/deserializing costs in use of combiner, by sending fewer records to the combiner. It has been shown to speed up the map task in a group by query by up to 50%. Since this is a very new feature, it currently turned off by default. To turn it on, set the property pig.exec.mapPartAgg to true. If the group-by keys used for grouping don't result in sufficient reduction in number of records, the performance might be worse with this feature turned on. To prevent that from happening, the feature turns itself off if the reduction in records sent to combiner is not more than a configurable threshold. This threshold can be set using the property pig.exec.mapPartAgg.minReduction . It is set to a default value of 10, which means that the number of records that get sent to the combiner should be reduced by a factor of 10.

      Description

      Introduction

      Pig does (sort based) partial aggregation in map side through the use of combiner. MR serializes the output of map to a buffer, sorts it on the keys, deserializes and passes the values grouped on the keys to combiner phase. The same work of combiner can be done in the map phase itself by using a hash-map on the keys. This hash based (partial) aggregation can be done with or without a combiner phase.

      Benefits

      It will send fewer records to combiner and thereby -

      • Save on cost of serializing and de-serializing
      • Save on cost of lock calls on the combiner input buffer. (I have found this to be a significant cost for a query that was doing multiple group-by's in a single MR job. -Thejas)
      • The problem of running out of memory in reduce side, for queries like COUNT(distinct col) can be avoided. The OOM issue happens because very large records get created after the combiner run on merged reduce input. In case of combiner, you have no way of telling MR not to combine records in reduce side. The workaround is to disable combiner completely, and the opportunity to reduce map output size is lost.
      • When the foreach after group-by has both algebraic and non-algebraic functions, or if a bag is being projected, the combiner is not used. This is because the data size reduction in typical cases are not significant enough to justify the additional (de)serialization costs. But hash based aggregation can be used in such cases as well.
      • It is possible to turn off the in-map combine automatically if there is not enough 'combination' that is taking place to justify the overhead of the in-map combiner. (Idea borrowed from Hive jira.)
      • If input data is sorted, it is possible to do efficient map side (partial) aggregation with in-map combiner.

      Design proposal is here - https://cwiki.apache.org/confluence/display/PIG/PigInMapCombinerProposal

      1. PIG-2228.1.patch
        57 kB
        Thejas M Nair
      2. PIG-2228.2.patch
        76 kB
        Thejas M Nair
      3. PIG-2228.3.patch
        79 kB
        Thejas M Nair
      4. PIG-2228.4.patch
        83 kB
        Thejas M Nair
      5. PIG-2228.5.patch
        85 kB
        Thejas M Nair
      6. PIG-2228.6.patch
        88 kB
        Thejas M Nair

        Issue Links

          Activity

          Hide
          Thejas M Nair added a comment -

          PIG-2228.1.patch - initial patch , will be adding more test cases in another patch

          Show
          Thejas M Nair added a comment - PIG-2228 .1.patch - initial patch , will be adding more test cases in another patch
          Hide
          Thejas M Nair added a comment -

          PIG-2228.2.patch - patch with junit test cases. Will create another patch which includes e2e test cases.

          Show
          Thejas M Nair added a comment - PIG-2228 .2.patch - patch with junit test cases. Will create another patch which includes e2e test cases.
          Hide
          Thejas M Nair added a comment -

          PIG-2228.3.patch - patch with e2e test cases

          Show
          Thejas M Nair added a comment - PIG-2228 .3.patch - patch with e2e test cases
          Hide
          Dmitriy V. Ryaboy added a comment -

          Thejas, looks like a good bit of work! Do you have any benchmark results on this yet?

          It'd be great if you could post this on ReviewBoard, easier to comment that way.

          Show
          Dmitriy V. Ryaboy added a comment - Thejas, looks like a good bit of work! Do you have any benchmark results on this yet? It'd be great if you could post this on ReviewBoard, easier to comment that way.
          Hide
          Thejas M Nair added a comment -

          Do you have any benchmark results on this yet?

          I am running some performance tests, I will publish numbers in a day or two. I will do 'submit patch' after that.

          It'd be great if you could post this on ReviewBoard, easier to comment that way.

          Will do that once I am done with the performance tests.

          Show
          Thejas M Nair added a comment - Do you have any benchmark results on this yet? I am running some performance tests, I will publish numbers in a day or two. I will do 'submit patch' after that. It'd be great if you could post this on ReviewBoard, easier to comment that way. Will do that once I am done with the performance tests.
          Hide
          Thejas M Nair added a comment -

          Here are results of the performance benchmark I ran.
          The summary: There is improvement of upto 50% in map run time with map partial aggregation with in-map partial aggregation. But if there is not enough reduction number of map output records, there is a performance degradation . It makes sense to turn off this feature in that case.

          The input data was generated from pigmix page_views data with 40M records so that the large map and bag fields which were not used the tests are not included. The query used to create the input data -

          A = load '/user/pig/tests/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
              as (user : chararray, action : int, timespent : int , query_term :chararray, ip_addr : chararray, timestamp : long);
          store A into '/user/pig/tests/pigmix/page_views/pv_part';
          

          The map average runtime printed at end of

          Query Change in number of records Map Input -> Map Output -> Reduce input recs map avergage runtime improvement wrt trunk group keys projection in foreach after group-by
          group on many columns (groupall.pig) 40M -> 39.98M -> 39.79M -14% user, action,query_term,ip_addr,timestamp user,action, qt,ip,ts, AVG(B.timespent);
          group on single column with high cardinality(singlegroup_user.pig) 40M records -> 12 M -> 6.2M 8% user user, AVG(B.timespent)
          group on single column with low cardinality (singlegroup_action.pig) 40M -> 38 -> 38 38.6% action action, AVG(B.timespent)
          group-all (group_allstar.pig) 40M -> 10 -> 10 52% 'all' MAX(B.timestamp), SUM(B.action), AVG(B.timespent)

          Note: I used different split size for different queries so that the map runtime is around 1 minute (long enough to make time measurements more meaningful , but short enough for multiple runs). The number of maps varied from 9 - 20 in most cases. The same number of maps were used when same query was run with map-agg on/off settings.

          I also ran tests using input data compressed using bzip2, the performance numbers were different, although the trend was same. The reduction in number of map output records that would justify use of map partial aggregation was different - singlegroup_user.pig which showed 8% performance improvement with uncompressed data had 3.5% performance degradation with bzip2 input.

          Show
          Thejas M Nair added a comment - Here are results of the performance benchmark I ran. The summary: There is improvement of upto 50% in map run time with map partial aggregation with in-map partial aggregation. But if there is not enough reduction number of map output records, there is a performance degradation . It makes sense to turn off this feature in that case. The input data was generated from pigmix page_views data with 40M records so that the large map and bag fields which were not used the tests are not included. The query used to create the input data - A = load '/user/pig/tests/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user : chararray, action : int , timespent : int , query_term :chararray, ip_addr : chararray, timestamp : long ); store A into '/user/pig/tests/pigmix/page_views/pv_part'; The map average runtime printed at end of Query Change in number of records Map Input -> Map Output -> Reduce input recs map avergage runtime improvement wrt trunk group keys projection in foreach after group-by group on many columns (groupall.pig) 40M -> 39.98M -> 39.79M -14% user, action,query_term,ip_addr,timestamp user,action, qt,ip,ts, AVG(B.timespent); group on single column with high cardinality(singlegroup_user.pig) 40M records -> 12 M -> 6.2M 8% user user, AVG(B.timespent) group on single column with low cardinality (singlegroup_action.pig) 40M -> 38 -> 38 38.6% action action, AVG(B.timespent) group-all (group_allstar.pig) 40M -> 10 -> 10 52% 'all' MAX(B.timestamp), SUM(B.action), AVG(B.timespent) Note: I used different split size for different queries so that the map runtime is around 1 minute (long enough to make time measurements more meaningful , but short enough for multiple runs). The number of maps varied from 9 - 20 in most cases. The same number of maps were used when same query was run with map-agg on/off settings. I also ran tests using input data compressed using bzip2, the performance numbers were different, although the trend was same. The reduction in number of map output records that would justify use of map partial aggregation was different - singlegroup_user.pig which showed 8% performance improvement with uncompressed data had 3.5% performance degradation with bzip2 input.
          Hide
          Thejas M Nair added a comment -

          This patch adds auto-disabling of in-map partial aggregation if there is not sufficient reduction in map records. The threshold for minimum reduction that should happen is set using the property pig.exec.mapPartAgg.minReduction .

          With minimum auto disabling threshold set to 10 (ie 10 input map records to one output record), the queries which had performance degradation with this feature turned on now performed at par.

          For now, the feature is turned off by default, we can look into changing this after users get more time to try it out.

          Show
          Thejas M Nair added a comment - This patch adds auto-disabling of in-map partial aggregation if there is not sufficient reduction in map records. The threshold for minimum reduction that should happen is set using the property pig.exec.mapPartAgg.minReduction . With minimum auto disabling threshold set to 10 (ie 10 input map records to one output record), the queries which had performance degradation with this feature turned on now performed at par. For now, the feature is turned off by default, we can look into changing this after users get more time to try it out.
          Hide
          Thejas M Nair added a comment -

          PIG-2228.5.patch - added description of the two new properties introduced in this jira, in pig usage (output of '-h properties')

          Show
          Thejas M Nair added a comment - PIG-2228 .5.patch - added description of the two new properties introduced in this jira, in pig usage (output of '-h properties')
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1817/
          -----------------------------------------------------------

          Review request for pig, Daniel Dai and Dmitriy Ryaboy.

          Summary
          -------

          See PIG-2228

          This addresses bug PIG-2228.
          https://issues.apache.org/jira/browse/PIG-2228

          Diffs


          trunk/src/org/apache/pig/Algebraic.java 1164722
          trunk/src/org/apache/pig/Main.java 1164722
          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1164722
          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1164722
          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1164722
          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1164722
          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1164722
          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1164722
          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1164722
          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION
          trunk/src/org/apache/pig/data/DefaultTuple.java 1164722
          trunk/src/org/apache/pig/data/InternalCachedBag.java 1164722
          trunk/src/org/apache/pig/data/InternalDistinctBag.java 1164722
          trunk/src/org/apache/pig/data/InternalSortedBag.java 1164722
          trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION
          trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION
          trunk/src/org/apache/pig/data/SortedSpillBag.java 1164722
          trunk/test/e2e/pig/tests/nightly.conf 1164722
          trunk/test/org/apache/pig/test/TestDataBag.java 1164722
          trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION
          trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION
          trunk/test/org/apache/pig/test/Util.java 1164722
          trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1164722

          Diff: https://reviews.apache.org/r/1817/diff

          Testing
          -------

          test-patch
          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 21 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings).
          release audit failures are because of jdiff changes

          All unit tests pass, new e2e tests added .

          Thanks,

          Thejas

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/ ----------------------------------------------------------- Review request for pig, Daniel Dai and Dmitriy Ryaboy. Summary ------- See PIG-2228 This addresses bug PIG-2228 . https://issues.apache.org/jira/browse/PIG-2228 Diffs trunk/src/org/apache/pig/Algebraic.java 1164722 trunk/src/org/apache/pig/Main.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION trunk/src/org/apache/pig/data/DefaultTuple.java 1164722 trunk/src/org/apache/pig/data/InternalCachedBag.java 1164722 trunk/src/org/apache/pig/data/InternalDistinctBag.java 1164722 trunk/src/org/apache/pig/data/InternalSortedBag.java 1164722 trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION trunk/src/org/apache/pig/data/SortedSpillBag.java 1164722 trunk/test/e2e/pig/tests/nightly.conf 1164722 trunk/test/org/apache/pig/test/TestDataBag.java 1164722 trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION trunk/test/org/apache/pig/test/Util.java 1164722 trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1164722 Diff: https://reviews.apache.org/r/1817/diff Testing ------- test-patch [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 21 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings). release audit failures are because of jdiff changes All unit tests pass, new e2e tests added . Thanks, Thejas
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1817/#review1868
          -----------------------------------------------------------

          trunk/src/org/apache/pig/Main.java
          <https://reviews.apache.org/r/1817/#comment4282>

          I would go 1 further and add this property to the default pig.properties, along with the rest listed here.

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java
          <https://reviews.apache.org/r/1817/#comment4284>

          there's a lot of errant whitespace around.. you can set your IDE to clean that up

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java
          <https://reviews.apache.org/r/1817/#comment4285>

          Not sure about the value of this comment

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4286>

          Leaves, technically

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4287>

          This is an extremely long function. Please refactor into more manageable chunks. That truly helps others to read and follow the code.

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4288>

          It's worth noting how mapDumpIterator be not null, and disableMapAgg be true (I think this is the case when you auto-disable in-map aggregation having processed some of the input, correct?)

          It might be more readable to just handle the disableMapAgg case at the very top, and not have to check it afterwards:

          if (disableMapAgg) {
          if (mapDumpIterator != null)

          { // deal with iterator }

          else

          { return processInput(); }

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4290>

          extra newline

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4291>

          extra newlines

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4292>

          return POStatus.STATUS_EOP would be clearer and not require a comment

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4293>

          this should be a separate function

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4294>

          what's at valueTuple[0]?

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4296>

          catch a class cast exception and rethrow with a meaningful error? ("Intermediate Algebraic functions must implement EvalFunc<Tuple>") .

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4297>

          newline

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4298>

          is ignoring nulls really the right thing to do? what if my UDF is "COUNT_NULLS"? Do we need a magical NULL object for this, or does this not work elsewhere already and we don't need to worry about supporting it?

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4299>

          shouldn't this be output.returnStatus?

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4300>

          extra newline

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4302>

          kinda crammed in here

          if (aggMap.size() >= maxHashMapSize)

          { aaah room to breathe trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java <https://reviews.apache.org/r/1817/#comment4301> this is quite trivial using LinkedHashMap initialized with the three-arg constructor; a one-line change. Worth a benchmark run. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java <https://reviews.apache.org/r/1817/#comment4303> }

          else {

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4304>

          Hortonworks must buy you guys really large screens, I am jealous.

          I'll stop mentioning the newlines now, this is probably starting to annoy you .

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4305>

          get().getInt("pig.exec.mapPartAgg.minReducton", 0);

          please make the property public static or something else that's referenceable / discoverable

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4306>

          default should be a constant at the top of the file

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4307>

          catch the classcastexception, rethrow with something more meaningful about Inermediate EvalFuncs being required to be EvalFunc<Tuple>

          trunk/src/org/apache/pig/data/SelfSpillBag.java
          <https://reviews.apache.org/r/1817/#comment4310>

          nice refactor

          trunk/src/org/apache/pig/data/SelfSpillBag.java
          <https://reviews.apache.org/r/1817/#comment4311>

          pig.cachedbag.memusage should be public static final

          trunk/test/org/apache/pig/test/TestPOPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4314>

          public static final reference...

          also, use setInt

          trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java
          <https://reviews.apache.org/r/1817/#comment4312>

          this should be a public static final from the main body of code that you are referencing here

          setBoolean

          trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java
          <https://reviews.apache.org/r/1817/#comment4313>

          p-s-f

          • Dmitriy

          On 2011-09-12 23:55:12, Thejas Nair wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1817/

          -----------------------------------------------------------

          (Updated 2011-09-12 23:55:12)

          Review request for pig, Daniel Dai and Dmitriy Ryaboy.

          Summary

          -------

          See PIG-2228

          This addresses bug PIG-2228.

          https://issues.apache.org/jira/browse/PIG-2228

          Diffs

          -----

          trunk/src/org/apache/pig/Algebraic.java 1164722

          trunk/src/org/apache/pig/Main.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION

          trunk/src/org/apache/pig/data/DefaultTuple.java 1164722

          trunk/src/org/apache/pig/data/InternalCachedBag.java 1164722

          trunk/src/org/apache/pig/data/InternalDistinctBag.java 1164722

          trunk/src/org/apache/pig/data/InternalSortedBag.java 1164722

          trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION

          trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION

          trunk/src/org/apache/pig/data/SortedSpillBag.java 1164722

          trunk/test/e2e/pig/tests/nightly.conf 1164722

          trunk/test/org/apache/pig/test/TestDataBag.java 1164722

          trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION

          trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION

          trunk/test/org/apache/pig/test/Util.java 1164722

          trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1164722

          Diff: https://reviews.apache.org/r/1817/diff

          Testing

          -------

          test-patch

          [exec] -1 overall.

          [exec]

          [exec] +1 @author. The patch does not contain any @author tags.

          [exec]

          [exec] +1 tests included. The patch appears to include 21 new or modified tests.

          [exec]

          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

          [exec]

          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          [exec]

          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

          [exec]

          [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings).

          release audit failures are because of jdiff changes

          All unit tests pass, new e2e tests added .

          Thanks,

          Thejas

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/#review1868 ----------------------------------------------------------- trunk/src/org/apache/pig/Main.java < https://reviews.apache.org/r/1817/#comment4282 > I would go 1 further and add this property to the default pig.properties, along with the rest listed here. trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java < https://reviews.apache.org/r/1817/#comment4284 > there's a lot of errant whitespace around.. you can set your IDE to clean that up trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java < https://reviews.apache.org/r/1817/#comment4285 > Not sure about the value of this comment trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4286 > Leaves, technically trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4287 > This is an extremely long function. Please refactor into more manageable chunks. That truly helps others to read and follow the code. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4288 > It's worth noting how mapDumpIterator be not null, and disableMapAgg be true (I think this is the case when you auto-disable in-map aggregation having processed some of the input, correct?) It might be more readable to just handle the disableMapAgg case at the very top, and not have to check it afterwards: if (disableMapAgg) { if (mapDumpIterator != null) { // deal with iterator } else { return processInput(); } trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4290 > extra newline trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4291 > extra newlines trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4292 > return POStatus.STATUS_EOP would be clearer and not require a comment trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4293 > this should be a separate function trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4294 > what's at valueTuple [0] ? trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4296 > catch a class cast exception and rethrow with a meaningful error? ("Intermediate Algebraic functions must implement EvalFunc<Tuple>") . trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4297 > newline trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4298 > is ignoring nulls really the right thing to do? what if my UDF is "COUNT_NULLS"? Do we need a magical NULL object for this, or does this not work elsewhere already and we don't need to worry about supporting it? trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4299 > shouldn't this be output.returnStatus? trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4300 > extra newline trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4302 > kinda crammed in here if (aggMap.size() >= maxHashMapSize) { aaah room to breathe trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java <https://reviews.apache.org/r/1817/#comment4301> this is quite trivial using LinkedHashMap initialized with the three-arg constructor; a one-line change. Worth a benchmark run. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java <https://reviews.apache.org/r/1817/#comment4303> } else { trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4304 > Hortonworks must buy you guys really large screens, I am jealous. I'll stop mentioning the newlines now, this is probably starting to annoy you . trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4305 > get().getInt("pig.exec.mapPartAgg.minReducton", 0); please make the property public static or something else that's referenceable / discoverable trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4306 > default should be a constant at the top of the file trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4307 > catch the classcastexception, rethrow with something more meaningful about Inermediate EvalFuncs being required to be EvalFunc<Tuple> trunk/src/org/apache/pig/data/SelfSpillBag.java < https://reviews.apache.org/r/1817/#comment4310 > nice refactor trunk/src/org/apache/pig/data/SelfSpillBag.java < https://reviews.apache.org/r/1817/#comment4311 > pig.cachedbag.memusage should be public static final trunk/test/org/apache/pig/test/TestPOPartialAgg.java < https://reviews.apache.org/r/1817/#comment4314 > public static final reference... also, use setInt trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java < https://reviews.apache.org/r/1817/#comment4312 > this should be a public static final from the main body of code that you are referencing here setBoolean trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java < https://reviews.apache.org/r/1817/#comment4313 > p-s-f Dmitriy On 2011-09-12 23:55:12, Thejas Nair wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/ ----------------------------------------------------------- (Updated 2011-09-12 23:55:12) Review request for pig, Daniel Dai and Dmitriy Ryaboy. Summary ------- See PIG-2228 This addresses bug PIG-2228 . https://issues.apache.org/jira/browse/PIG-2228 Diffs ----- trunk/src/org/apache/pig/Algebraic.java 1164722 trunk/src/org/apache/pig/Main.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION trunk/src/org/apache/pig/data/DefaultTuple.java 1164722 trunk/src/org/apache/pig/data/InternalCachedBag.java 1164722 trunk/src/org/apache/pig/data/InternalDistinctBag.java 1164722 trunk/src/org/apache/pig/data/InternalSortedBag.java 1164722 trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION trunk/src/org/apache/pig/data/SortedSpillBag.java 1164722 trunk/test/e2e/pig/tests/nightly.conf 1164722 trunk/test/org/apache/pig/test/TestDataBag.java 1164722 trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION trunk/test/org/apache/pig/test/Util.java 1164722 trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1164722 Diff: https://reviews.apache.org/r/1817/diff Testing ------- test-patch [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 21 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings). release audit failures are because of jdiff changes All unit tests pass, new e2e tests added . Thanks, Thejas
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1817/#review1877
          -----------------------------------------------------------

          trunk/src/org/apache/pig/Main.java
          <https://reviews.apache.org/r/1817/#comment4318>

          will do. I will set the properties to the default value (if any), and ask the user to use "-h properties" command for description of the properties. (I don't want to increase the sources of truth for the details of the properties).

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java
          <https://reviews.apache.org/r/1817/#comment4319>

          Will do. For future - it might be a good idea to generate these settings for pig code conventions as well with the eclipse-files target.

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4320>

          fixing!

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4336>

          will do some refactoring.

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4322>

          Yes, the mapDumpIteral will be non null if there were accumulated entries in the map.

          Will move the disableMapAgg out to improve readability.

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4324>

          The return type is a Result object, STATUS_EOP is a byte. I don't want to create a new object. It might be useful to create 'static final' Result objects for these cases to improve readability. (Not doing that as part of this patch.)

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4325>

          described in comment where valueTuple is defined.
          // tuple of the format - (null(key),bag-val1,bag-val2,...)
          // attach this to the plans with algebraic udf before evaluating the plans
          private transient Tuple valueTuple = null;

          The combiner plan is used by POPartialAgg without resetting the input offsets, that is why the first field has the key.

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4326>

          changing. I thought a check was being done at pig query compile time, but apparently thats not the case. Created- PIG-2283

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4327>

          existingResult is the tuple that holds key and value(s), if the key is present in aggMap, it will never be null. I didn't want to make two hashmap function calls (.contains() + .get()).
          I am adding a comment to clarify.

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4328>

          this is the output of partial aggregation. it is the result of getOutput() function, which returns only ERR_RESULT or Result with STATUS_OK. Getting a status such as STATUS_NULL ,STATUS_EOP or STATUS_BATCH_OK does not make sense, and should result in an error (the query fails).

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4329>

          changing. i have re-formatted this whole file using eclipse, and its fixed all this.

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4330>

          Thanks for the suggesting it, that seems to be a good candidate. But I think that it is something that is better explored in a separate jira.
          It certainly needs a benchmark run. It might be important to select more 'realistic' data to see how useful LRU/LFU replacement is going to be. I am not sure if pigmix is representative of data where LRU/LFU will be useful.

          The memory footprint of LinkedHashMap would be higher. The memory usage estimation logic also would need to change as it is based on sizes for HashMap.

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4331>

          yeah, we do have large screens with awesome resolution
          But I will fix the excess newlines for the less fortunate

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4332>

          will create a public static final variable in POPartAgg for the property. But I think it would be good to define it in some central place with other properties. But we don't have such a place now.
          Will also use getInt.

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4333>

          changing.

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4334>

          the cast here is actually unnecessary, removing it.

          • Thejas

          On 2011-09-12 23:55:12, Thejas Nair wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1817/

          -----------------------------------------------------------

          (Updated 2011-09-12 23:55:12)

          Review request for pig, Daniel Dai and Dmitriy Ryaboy.

          Summary

          -------

          See PIG-2228

          This addresses bug PIG-2228.

          https://issues.apache.org/jira/browse/PIG-2228

          Diffs

          -----

          trunk/src/org/apache/pig/Algebraic.java 1164722

          trunk/src/org/apache/pig/Main.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION

          trunk/src/org/apache/pig/data/DefaultTuple.java 1164722

          trunk/src/org/apache/pig/data/InternalCachedBag.java 1164722

          trunk/src/org/apache/pig/data/InternalDistinctBag.java 1164722

          trunk/src/org/apache/pig/data/InternalSortedBag.java 1164722

          trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION

          trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION

          trunk/src/org/apache/pig/data/SortedSpillBag.java 1164722

          trunk/test/e2e/pig/tests/nightly.conf 1164722

          trunk/test/org/apache/pig/test/TestDataBag.java 1164722

          trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION

          trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION

          trunk/test/org/apache/pig/test/Util.java 1164722

          trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1164722

          Diff: https://reviews.apache.org/r/1817/diff

          Testing

          -------

          test-patch

          [exec] -1 overall.

          [exec]

          [exec] +1 @author. The patch does not contain any @author tags.

          [exec]

          [exec] +1 tests included. The patch appears to include 21 new or modified tests.

          [exec]

          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

          [exec]

          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          [exec]

          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

          [exec]

          [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings).

          release audit failures are because of jdiff changes

          All unit tests pass, new e2e tests added .

          Thanks,

          Thejas

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/#review1877 ----------------------------------------------------------- trunk/src/org/apache/pig/Main.java < https://reviews.apache.org/r/1817/#comment4318 > will do. I will set the properties to the default value (if any), and ask the user to use "-h properties" command for description of the properties. (I don't want to increase the sources of truth for the details of the properties). trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java < https://reviews.apache.org/r/1817/#comment4319 > Will do. For future - it might be a good idea to generate these settings for pig code conventions as well with the eclipse-files target. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4320 > fixing! trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4336 > will do some refactoring. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4322 > Yes, the mapDumpIteral will be non null if there were accumulated entries in the map. Will move the disableMapAgg out to improve readability. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4324 > The return type is a Result object, STATUS_EOP is a byte. I don't want to create a new object. It might be useful to create 'static final' Result objects for these cases to improve readability. (Not doing that as part of this patch.) trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4325 > described in comment where valueTuple is defined. // tuple of the format - (null(key),bag-val1,bag-val2,...) // attach this to the plans with algebraic udf before evaluating the plans private transient Tuple valueTuple = null; The combiner plan is used by POPartialAgg without resetting the input offsets, that is why the first field has the key. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4326 > changing. I thought a check was being done at pig query compile time, but apparently thats not the case. Created- PIG-2283 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4327 > existingResult is the tuple that holds key and value(s), if the key is present in aggMap, it will never be null. I didn't want to make two hashmap function calls (.contains() + .get()). I am adding a comment to clarify. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4328 > this is the output of partial aggregation. it is the result of getOutput() function, which returns only ERR_RESULT or Result with STATUS_OK. Getting a status such as STATUS_NULL ,STATUS_EOP or STATUS_BATCH_OK does not make sense, and should result in an error (the query fails). trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4329 > changing. i have re-formatted this whole file using eclipse, and its fixed all this. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4330 > Thanks for the suggesting it, that seems to be a good candidate. But I think that it is something that is better explored in a separate jira. It certainly needs a benchmark run. It might be important to select more 'realistic' data to see how useful LRU/LFU replacement is going to be. I am not sure if pigmix is representative of data where LRU/LFU will be useful. The memory footprint of LinkedHashMap would be higher. The memory usage estimation logic also would need to change as it is based on sizes for HashMap. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4331 > yeah, we do have large screens with awesome resolution But I will fix the excess newlines for the less fortunate trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4332 > will create a public static final variable in POPartAgg for the property. But I think it would be good to define it in some central place with other properties. But we don't have such a place now. Will also use getInt. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4333 > changing. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4334 > the cast here is actually unnecessary, removing it. Thejas On 2011-09-12 23:55:12, Thejas Nair wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/ ----------------------------------------------------------- (Updated 2011-09-12 23:55:12) Review request for pig, Daniel Dai and Dmitriy Ryaboy. Summary ------- See PIG-2228 This addresses bug PIG-2228 . https://issues.apache.org/jira/browse/PIG-2228 Diffs ----- trunk/src/org/apache/pig/Algebraic.java 1164722 trunk/src/org/apache/pig/Main.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION trunk/src/org/apache/pig/data/DefaultTuple.java 1164722 trunk/src/org/apache/pig/data/InternalCachedBag.java 1164722 trunk/src/org/apache/pig/data/InternalDistinctBag.java 1164722 trunk/src/org/apache/pig/data/InternalSortedBag.java 1164722 trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION trunk/src/org/apache/pig/data/SortedSpillBag.java 1164722 trunk/test/e2e/pig/tests/nightly.conf 1164722 trunk/test/org/apache/pig/test/TestDataBag.java 1164722 trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION trunk/test/org/apache/pig/test/Util.java 1164722 trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1164722 Diff: https://reviews.apache.org/r/1817/diff Testing ------- test-patch [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 21 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings). release audit failures are because of jdiff changes All unit tests pass, new e2e tests added . Thanks, Thejas
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1817/#review1894
          -----------------------------------------------------------

          Patch looks pretty good.

          We also need to collect this feature in the stats (PIG_FEATURE). We can open a separate Jira for that.

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4375>

          Do you mean NUM_INPRECS_TO_SAMPLE_SZ_REDUCTION?

          • Daniel

          On 2011-09-12 23:55:12, Thejas Nair wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1817/

          -----------------------------------------------------------

          (Updated 2011-09-12 23:55:12)

          Review request for pig, Daniel Dai and Dmitriy Ryaboy.

          Summary

          -------

          See PIG-2228

          This addresses bug PIG-2228.

          https://issues.apache.org/jira/browse/PIG-2228

          Diffs

          -----

          trunk/src/org/apache/pig/Algebraic.java 1164722

          trunk/src/org/apache/pig/Main.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1164722

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION

          trunk/src/org/apache/pig/data/DefaultTuple.java 1164722

          trunk/src/org/apache/pig/data/InternalCachedBag.java 1164722

          trunk/src/org/apache/pig/data/InternalDistinctBag.java 1164722

          trunk/src/org/apache/pig/data/InternalSortedBag.java 1164722

          trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION

          trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION

          trunk/src/org/apache/pig/data/SortedSpillBag.java 1164722

          trunk/test/e2e/pig/tests/nightly.conf 1164722

          trunk/test/org/apache/pig/test/TestDataBag.java 1164722

          trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION

          trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION

          trunk/test/org/apache/pig/test/Util.java 1164722

          trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1164722

          Diff: https://reviews.apache.org/r/1817/diff

          Testing

          -------

          test-patch

          [exec] -1 overall.

          [exec]

          [exec] +1 @author. The patch does not contain any @author tags.

          [exec]

          [exec] +1 tests included. The patch appears to include 21 new or modified tests.

          [exec]

          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

          [exec]

          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          [exec]

          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

          [exec]

          [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings).

          release audit failures are because of jdiff changes

          All unit tests pass, new e2e tests added .

          Thanks,

          Thejas

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/#review1894 ----------------------------------------------------------- Patch looks pretty good. We also need to collect this feature in the stats (PIG_FEATURE). We can open a separate Jira for that. trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4375 > Do you mean NUM_INPRECS_TO_SAMPLE_SZ_REDUCTION? Daniel On 2011-09-12 23:55:12, Thejas Nair wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/ ----------------------------------------------------------- (Updated 2011-09-12 23:55:12) Review request for pig, Daniel Dai and Dmitriy Ryaboy. Summary ------- See PIG-2228 This addresses bug PIG-2228 . https://issues.apache.org/jira/browse/PIG-2228 Diffs ----- trunk/src/org/apache/pig/Algebraic.java 1164722 trunk/src/org/apache/pig/Main.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1164722 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION trunk/src/org/apache/pig/data/DefaultTuple.java 1164722 trunk/src/org/apache/pig/data/InternalCachedBag.java 1164722 trunk/src/org/apache/pig/data/InternalDistinctBag.java 1164722 trunk/src/org/apache/pig/data/InternalSortedBag.java 1164722 trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION trunk/src/org/apache/pig/data/SortedSpillBag.java 1164722 trunk/test/e2e/pig/tests/nightly.conf 1164722 trunk/test/org/apache/pig/test/TestDataBag.java 1164722 trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION trunk/test/org/apache/pig/test/Util.java 1164722 trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1164722 Diff: https://reviews.apache.org/r/1817/diff Testing ------- test-patch [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 21 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings). release audit failures are because of jdiff changes All unit tests pass, new e2e tests added . Thanks, Thejas
          Hide
          Thejas M Nair added a comment -

          PIG-2228.6.patch - address comments from code review.

          Show
          Thejas M Nair added a comment - PIG-2228 .6.patch - address comments from code review.
          Hide
          Thejas M Nair added a comment -

          note: POPartialAgg will have less overhead when the records with same group-by key values appear next to each other, for example, when the data is sorted. In such cases, there will be fewer HashMap operations.

          Show
          Thejas M Nair added a comment - note: POPartialAgg will have less overhead when the records with same group-by key values appear next to each other, for example, when the data is sorted. In such cases, there will be fewer HashMap operations.
          Hide
          Daniel Dai added a comment -

          +1 for PIG-2228.6.patch

          Show
          Daniel Dai added a comment - +1 for PIG-2228 .6.patch
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1817/
          -----------------------------------------------------------

          (Updated 2011-09-15 17:27:08.965065)

          Review request for pig, Daniel Dai and Dmitriy Ryaboy.

          Changes
          -------

          New patch addressing code review comments.

          Summary
          -------

          See PIG-2228

          This addresses bug PIG-2228.
          https://issues.apache.org/jira/browse/PIG-2228

          Diffs (updated)


          trunk/conf/pig.properties 1170885
          trunk/src/org/apache/pig/Algebraic.java 1170885
          trunk/src/org/apache/pig/Main.java 1170885
          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1170885
          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1170885
          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1170885
          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1170885
          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1170885
          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1170885
          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1170885
          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION
          trunk/src/org/apache/pig/data/DefaultTuple.java 1170885
          trunk/src/org/apache/pig/data/InternalCachedBag.java 1170885
          trunk/src/org/apache/pig/data/InternalDistinctBag.java 1170885
          trunk/src/org/apache/pig/data/InternalSortedBag.java 1170885
          trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION
          trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION
          trunk/src/org/apache/pig/data/SortedSpillBag.java 1170885
          trunk/src/org/apache/pig/tools/pigstats/ScriptState.java 1170885
          trunk/test/e2e/pig/tests/nightly.conf 1170885
          trunk/test/org/apache/pig/test/TestDataBag.java 1170885
          trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION
          trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION
          trunk/test/org/apache/pig/test/Util.java 1170885
          trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1170885

          Diff: https://reviews.apache.org/r/1817/diff

          Testing
          -------

          test-patch
          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 21 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings).
          release audit failures are because of jdiff changes

          All unit tests pass, new e2e tests added .

          Thanks,

          Thejas

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/ ----------------------------------------------------------- (Updated 2011-09-15 17:27:08.965065) Review request for pig, Daniel Dai and Dmitriy Ryaboy. Changes ------- New patch addressing code review comments. Summary ------- See PIG-2228 This addresses bug PIG-2228 . https://issues.apache.org/jira/browse/PIG-2228 Diffs (updated) trunk/conf/pig.properties 1170885 trunk/src/org/apache/pig/Algebraic.java 1170885 trunk/src/org/apache/pig/Main.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION trunk/src/org/apache/pig/data/DefaultTuple.java 1170885 trunk/src/org/apache/pig/data/InternalCachedBag.java 1170885 trunk/src/org/apache/pig/data/InternalDistinctBag.java 1170885 trunk/src/org/apache/pig/data/InternalSortedBag.java 1170885 trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION trunk/src/org/apache/pig/data/SortedSpillBag.java 1170885 trunk/src/org/apache/pig/tools/pigstats/ScriptState.java 1170885 trunk/test/e2e/pig/tests/nightly.conf 1170885 trunk/test/org/apache/pig/test/TestDataBag.java 1170885 trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION trunk/test/org/apache/pig/test/Util.java 1170885 trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1170885 Diff: https://reviews.apache.org/r/1817/diff Testing ------- test-patch [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 21 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings). release audit failures are because of jdiff changes All unit tests pass, new e2e tests added . Thanks, Thejas
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-09-14 18:00:38, Daniel Dai wrote:

          > Patch looks pretty good.

          >

          > We also need to collect this feature in the stats (PIG_FEATURE). We can open a separate Jira for that.

          Created PIG-2285.

          On 2011-09-14 18:00:38, Daniel Dai wrote:

          > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java, line 203

          > <https://reviews.apache.org/r/1817/diff/1/?file=40194#file40194line203>

          >

          > Do you mean NUM_INPRECS_TO_SAMPLE_SZ_REDUCTION?

          yes, fixing the comment.

          • Thejas

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1817/#review1894
          -----------------------------------------------------------

          On 2011-09-15 17:27:08, Thejas Nair wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1817/

          -----------------------------------------------------------

          (Updated 2011-09-15 17:27:08)

          Review request for pig, Daniel Dai and Dmitriy Ryaboy.

          Summary

          -------

          See PIG-2228

          This addresses bug PIG-2228.

          https://issues.apache.org/jira/browse/PIG-2228

          Diffs

          -----

          trunk/conf/pig.properties 1170885

          trunk/src/org/apache/pig/Algebraic.java 1170885

          trunk/src/org/apache/pig/Main.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION

          trunk/src/org/apache/pig/data/DefaultTuple.java 1170885

          trunk/src/org/apache/pig/data/InternalCachedBag.java 1170885

          trunk/src/org/apache/pig/data/InternalDistinctBag.java 1170885

          trunk/src/org/apache/pig/data/InternalSortedBag.java 1170885

          trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION

          trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION

          trunk/src/org/apache/pig/data/SortedSpillBag.java 1170885

          trunk/src/org/apache/pig/tools/pigstats/ScriptState.java 1170885

          trunk/test/e2e/pig/tests/nightly.conf 1170885

          trunk/test/org/apache/pig/test/TestDataBag.java 1170885

          trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION

          trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION

          trunk/test/org/apache/pig/test/Util.java 1170885

          trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1170885

          Diff: https://reviews.apache.org/r/1817/diff

          Testing

          -------

          test-patch

          [exec] -1 overall.

          [exec]

          [exec] +1 @author. The patch does not contain any @author tags.

          [exec]

          [exec] +1 tests included. The patch appears to include 21 new or modified tests.

          [exec]

          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

          [exec]

          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          [exec]

          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

          [exec]

          [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings).

          release audit failures are because of jdiff changes

          All unit tests pass, new e2e tests added .

          Thanks,

          Thejas

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-09-14 18:00:38, Daniel Dai wrote: > Patch looks pretty good. > > We also need to collect this feature in the stats (PIG_FEATURE). We can open a separate Jira for that. Created PIG-2285 . On 2011-09-14 18:00:38, Daniel Dai wrote: > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java, line 203 > < https://reviews.apache.org/r/1817/diff/1/?file=40194#file40194line203 > > > Do you mean NUM_INPRECS_TO_SAMPLE_SZ_REDUCTION? yes, fixing the comment. Thejas ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/#review1894 ----------------------------------------------------------- On 2011-09-15 17:27:08, Thejas Nair wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/ ----------------------------------------------------------- (Updated 2011-09-15 17:27:08) Review request for pig, Daniel Dai and Dmitriy Ryaboy. Summary ------- See PIG-2228 This addresses bug PIG-2228 . https://issues.apache.org/jira/browse/PIG-2228 Diffs ----- trunk/conf/pig.properties 1170885 trunk/src/org/apache/pig/Algebraic.java 1170885 trunk/src/org/apache/pig/Main.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION trunk/src/org/apache/pig/data/DefaultTuple.java 1170885 trunk/src/org/apache/pig/data/InternalCachedBag.java 1170885 trunk/src/org/apache/pig/data/InternalDistinctBag.java 1170885 trunk/src/org/apache/pig/data/InternalSortedBag.java 1170885 trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION trunk/src/org/apache/pig/data/SortedSpillBag.java 1170885 trunk/src/org/apache/pig/tools/pigstats/ScriptState.java 1170885 trunk/test/e2e/pig/tests/nightly.conf 1170885 trunk/test/org/apache/pig/test/TestDataBag.java 1170885 trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION trunk/test/org/apache/pig/test/Util.java 1170885 trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1170885 Diff: https://reviews.apache.org/r/1817/diff Testing ------- test-patch [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 21 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings). release audit failures are because of jdiff changes All unit tests pass, new e2e tests added . Thanks, Thejas
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-09-13 09:15:46, Dmitriy Ryaboy wrote:

          > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java, line 296

          > <https://reviews.apache.org/r/1817/diff/1/?file=40193#file40193line296>

          >

          > Not sure about the value of this comment

          cleaning that

          • Thejas

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1817/#review1868
          -----------------------------------------------------------

          On 2011-09-15 17:27:08, Thejas Nair wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1817/

          -----------------------------------------------------------

          (Updated 2011-09-15 17:27:08)

          Review request for pig, Daniel Dai and Dmitriy Ryaboy.

          Summary

          -------

          See PIG-2228

          This addresses bug PIG-2228.

          https://issues.apache.org/jira/browse/PIG-2228

          Diffs

          -----

          trunk/conf/pig.properties 1170885

          trunk/src/org/apache/pig/Algebraic.java 1170885

          trunk/src/org/apache/pig/Main.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION

          trunk/src/org/apache/pig/data/DefaultTuple.java 1170885

          trunk/src/org/apache/pig/data/InternalCachedBag.java 1170885

          trunk/src/org/apache/pig/data/InternalDistinctBag.java 1170885

          trunk/src/org/apache/pig/data/InternalSortedBag.java 1170885

          trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION

          trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION

          trunk/src/org/apache/pig/data/SortedSpillBag.java 1170885

          trunk/src/org/apache/pig/tools/pigstats/ScriptState.java 1170885

          trunk/test/e2e/pig/tests/nightly.conf 1170885

          trunk/test/org/apache/pig/test/TestDataBag.java 1170885

          trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION

          trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION

          trunk/test/org/apache/pig/test/Util.java 1170885

          trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1170885

          Diff: https://reviews.apache.org/r/1817/diff

          Testing

          -------

          test-patch

          [exec] -1 overall.

          [exec]

          [exec] +1 @author. The patch does not contain any @author tags.

          [exec]

          [exec] +1 tests included. The patch appears to include 21 new or modified tests.

          [exec]

          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

          [exec]

          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          [exec]

          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

          [exec]

          [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings).

          release audit failures are because of jdiff changes

          All unit tests pass, new e2e tests added .

          Thanks,

          Thejas

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-09-13 09:15:46, Dmitriy Ryaboy wrote: > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java, line 296 > < https://reviews.apache.org/r/1817/diff/1/?file=40193#file40193line296 > > > Not sure about the value of this comment cleaning that Thejas ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/#review1868 ----------------------------------------------------------- On 2011-09-15 17:27:08, Thejas Nair wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/ ----------------------------------------------------------- (Updated 2011-09-15 17:27:08) Review request for pig, Daniel Dai and Dmitriy Ryaboy. Summary ------- See PIG-2228 This addresses bug PIG-2228 . https://issues.apache.org/jira/browse/PIG-2228 Diffs ----- trunk/conf/pig.properties 1170885 trunk/src/org/apache/pig/Algebraic.java 1170885 trunk/src/org/apache/pig/Main.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION trunk/src/org/apache/pig/data/DefaultTuple.java 1170885 trunk/src/org/apache/pig/data/InternalCachedBag.java 1170885 trunk/src/org/apache/pig/data/InternalDistinctBag.java 1170885 trunk/src/org/apache/pig/data/InternalSortedBag.java 1170885 trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION trunk/src/org/apache/pig/data/SortedSpillBag.java 1170885 trunk/src/org/apache/pig/tools/pigstats/ScriptState.java 1170885 trunk/test/e2e/pig/tests/nightly.conf 1170885 trunk/test/org/apache/pig/test/TestDataBag.java 1170885 trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION trunk/test/org/apache/pig/test/Util.java 1170885 trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1170885 Diff: https://reviews.apache.org/r/1817/diff Testing ------- test-patch [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 21 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings). release audit failures are because of jdiff changes All unit tests pass, new e2e tests added . Thanks, Thejas
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1817/#review1914
          -----------------------------------------------------------

          Ship it!

          • Daniel

          On 2011-09-15 17:27:08, Thejas Nair wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1817/

          -----------------------------------------------------------

          (Updated 2011-09-15 17:27:08)

          Review request for pig, Daniel Dai and Dmitriy Ryaboy.

          Summary

          -------

          See PIG-2228

          This addresses bug PIG-2228.

          https://issues.apache.org/jira/browse/PIG-2228

          Diffs

          -----

          trunk/conf/pig.properties 1170885

          trunk/src/org/apache/pig/Algebraic.java 1170885

          trunk/src/org/apache/pig/Main.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION

          trunk/src/org/apache/pig/data/DefaultTuple.java 1170885

          trunk/src/org/apache/pig/data/InternalCachedBag.java 1170885

          trunk/src/org/apache/pig/data/InternalDistinctBag.java 1170885

          trunk/src/org/apache/pig/data/InternalSortedBag.java 1170885

          trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION

          trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION

          trunk/src/org/apache/pig/data/SortedSpillBag.java 1170885

          trunk/src/org/apache/pig/tools/pigstats/ScriptState.java 1170885

          trunk/test/e2e/pig/tests/nightly.conf 1170885

          trunk/test/org/apache/pig/test/TestDataBag.java 1170885

          trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION

          trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION

          trunk/test/org/apache/pig/test/Util.java 1170885

          trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1170885

          Diff: https://reviews.apache.org/r/1817/diff

          Testing

          -------

          test-patch

          [exec] -1 overall.

          [exec]

          [exec] +1 @author. The patch does not contain any @author tags.

          [exec]

          [exec] +1 tests included. The patch appears to include 21 new or modified tests.

          [exec]

          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

          [exec]

          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          [exec]

          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

          [exec]

          [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings).

          release audit failures are because of jdiff changes

          All unit tests pass, new e2e tests added .

          Thanks,

          Thejas

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/#review1914 ----------------------------------------------------------- Ship it! Daniel On 2011-09-15 17:27:08, Thejas Nair wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/ ----------------------------------------------------------- (Updated 2011-09-15 17:27:08) Review request for pig, Daniel Dai and Dmitriy Ryaboy. Summary ------- See PIG-2228 This addresses bug PIG-2228 . https://issues.apache.org/jira/browse/PIG-2228 Diffs ----- trunk/conf/pig.properties 1170885 trunk/src/org/apache/pig/Algebraic.java 1170885 trunk/src/org/apache/pig/Main.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION trunk/src/org/apache/pig/data/DefaultTuple.java 1170885 trunk/src/org/apache/pig/data/InternalCachedBag.java 1170885 trunk/src/org/apache/pig/data/InternalDistinctBag.java 1170885 trunk/src/org/apache/pig/data/InternalSortedBag.java 1170885 trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION trunk/src/org/apache/pig/data/SortedSpillBag.java 1170885 trunk/src/org/apache/pig/tools/pigstats/ScriptState.java 1170885 trunk/test/e2e/pig/tests/nightly.conf 1170885 trunk/test/org/apache/pig/test/TestDataBag.java 1170885 trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION trunk/test/org/apache/pig/test/Util.java 1170885 trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1170885 Diff: https://reviews.apache.org/r/1817/diff Testing ------- test-patch [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 21 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings). release audit failures are because of jdiff changes All unit tests pass, new e2e tests added . Thanks, Thejas
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1817/#review1915
          -----------------------------------------------------------

          Ship it!

          lgtm assuming you tested the heck out of this

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4396>

          ?

          • Dmitriy

          On 2011-09-15 17:27:08, Thejas Nair wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1817/

          -----------------------------------------------------------

          (Updated 2011-09-15 17:27:08)

          Review request for pig, Daniel Dai and Dmitriy Ryaboy.

          Summary

          -------

          See PIG-2228

          This addresses bug PIG-2228.

          https://issues.apache.org/jira/browse/PIG-2228

          Diffs

          -----

          trunk/conf/pig.properties 1170885

          trunk/src/org/apache/pig/Algebraic.java 1170885

          trunk/src/org/apache/pig/Main.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION

          trunk/src/org/apache/pig/data/DefaultTuple.java 1170885

          trunk/src/org/apache/pig/data/InternalCachedBag.java 1170885

          trunk/src/org/apache/pig/data/InternalDistinctBag.java 1170885

          trunk/src/org/apache/pig/data/InternalSortedBag.java 1170885

          trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION

          trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION

          trunk/src/org/apache/pig/data/SortedSpillBag.java 1170885

          trunk/src/org/apache/pig/tools/pigstats/ScriptState.java 1170885

          trunk/test/e2e/pig/tests/nightly.conf 1170885

          trunk/test/org/apache/pig/test/TestDataBag.java 1170885

          trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION

          trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION

          trunk/test/org/apache/pig/test/Util.java 1170885

          trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1170885

          Diff: https://reviews.apache.org/r/1817/diff

          Testing

          -------

          test-patch

          [exec] -1 overall.

          [exec]

          [exec] +1 @author. The patch does not contain any @author tags.

          [exec]

          [exec] +1 tests included. The patch appears to include 21 new or modified tests.

          [exec]

          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

          [exec]

          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          [exec]

          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

          [exec]

          [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings).

          release audit failures are because of jdiff changes

          All unit tests pass, new e2e tests added .

          Thanks,

          Thejas

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/#review1915 ----------------------------------------------------------- Ship it! lgtm assuming you tested the heck out of this trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4396 > ? Dmitriy On 2011-09-15 17:27:08, Thejas Nair wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/ ----------------------------------------------------------- (Updated 2011-09-15 17:27:08) Review request for pig, Daniel Dai and Dmitriy Ryaboy. Summary ------- See PIG-2228 This addresses bug PIG-2228 . https://issues.apache.org/jira/browse/PIG-2228 Diffs ----- trunk/conf/pig.properties 1170885 trunk/src/org/apache/pig/Algebraic.java 1170885 trunk/src/org/apache/pig/Main.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION trunk/src/org/apache/pig/data/DefaultTuple.java 1170885 trunk/src/org/apache/pig/data/InternalCachedBag.java 1170885 trunk/src/org/apache/pig/data/InternalDistinctBag.java 1170885 trunk/src/org/apache/pig/data/InternalSortedBag.java 1170885 trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION trunk/src/org/apache/pig/data/SortedSpillBag.java 1170885 trunk/src/org/apache/pig/tools/pigstats/ScriptState.java 1170885 trunk/test/e2e/pig/tests/nightly.conf 1170885 trunk/test/org/apache/pig/test/TestDataBag.java 1170885 trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION trunk/test/org/apache/pig/test/Util.java 1170885 trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1170885 Diff: https://reviews.apache.org/r/1817/diff Testing ------- test-patch [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 21 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings). release audit failures are because of jdiff changes All unit tests pass, new e2e tests added . Thanks, Thejas
          Hide
          Thejas M Nair added a comment -

          Patch committed to trunk.

          Show
          Thejas M Nair added a comment - Patch committed to trunk.
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1817/#review1916
          -----------------------------------------------------------

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
          <https://reviews.apache.org/r/1817/#comment4397>

          removed the extra ";" in the patch checked in.

          • Thejas

          On 2011-09-15 17:27:08, Thejas Nair wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1817/

          -----------------------------------------------------------

          (Updated 2011-09-15 17:27:08)

          Review request for pig, Daniel Dai and Dmitriy Ryaboy.

          Summary

          -------

          See PIG-2228

          This addresses bug PIG-2228.

          https://issues.apache.org/jira/browse/PIG-2228

          Diffs

          -----

          trunk/conf/pig.properties 1170885

          trunk/src/org/apache/pig/Algebraic.java 1170885

          trunk/src/org/apache/pig/Main.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1170885

          trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION

          trunk/src/org/apache/pig/data/DefaultTuple.java 1170885

          trunk/src/org/apache/pig/data/InternalCachedBag.java 1170885

          trunk/src/org/apache/pig/data/InternalDistinctBag.java 1170885

          trunk/src/org/apache/pig/data/InternalSortedBag.java 1170885

          trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION

          trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION

          trunk/src/org/apache/pig/data/SortedSpillBag.java 1170885

          trunk/src/org/apache/pig/tools/pigstats/ScriptState.java 1170885

          trunk/test/e2e/pig/tests/nightly.conf 1170885

          trunk/test/org/apache/pig/test/TestDataBag.java 1170885

          trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION

          trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION

          trunk/test/org/apache/pig/test/Util.java 1170885

          trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1170885

          Diff: https://reviews.apache.org/r/1817/diff

          Testing

          -------

          test-patch

          [exec] -1 overall.

          [exec]

          [exec] +1 @author. The patch does not contain any @author tags.

          [exec]

          [exec] +1 tests included. The patch appears to include 21 new or modified tests.

          [exec]

          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

          [exec]

          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          [exec]

          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

          [exec]

          [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings).

          release audit failures are because of jdiff changes

          All unit tests pass, new e2e tests added .

          Thanks,

          Thejas

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/#review1916 ----------------------------------------------------------- trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java < https://reviews.apache.org/r/1817/#comment4397 > removed the extra ";" in the patch checked in. Thejas On 2011-09-15 17:27:08, Thejas Nair wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1817/ ----------------------------------------------------------- (Updated 2011-09-15 17:27:08) Review request for pig, Daniel Dai and Dmitriy Ryaboy. Summary ------- See PIG-2228 This addresses bug PIG-2228 . https://issues.apache.org/jira/browse/PIG-2228 Diffs ----- trunk/conf/pig.properties 1170885 trunk/src/org/apache/pig/Algebraic.java 1170885 trunk/src/org/apache/pig/Main.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java 1170885 trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java PRE-CREATION trunk/src/org/apache/pig/data/DefaultTuple.java 1170885 trunk/src/org/apache/pig/data/InternalCachedBag.java 1170885 trunk/src/org/apache/pig/data/InternalDistinctBag.java 1170885 trunk/src/org/apache/pig/data/InternalSortedBag.java 1170885 trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION trunk/src/org/apache/pig/data/SortedSpillBag.java 1170885 trunk/src/org/apache/pig/tools/pigstats/ScriptState.java 1170885 trunk/test/e2e/pig/tests/nightly.conf 1170885 trunk/test/org/apache/pig/test/TestDataBag.java 1170885 trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION trunk/test/org/apache/pig/test/Util.java 1170885 trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1170885 Diff: https://reviews.apache.org/r/1817/diff Testing ------- test-patch [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 21 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 461 release audit warnings (more than the trunk's current 455 warnings). release audit failures are because of jdiff changes All unit tests pass, new e2e tests added . Thanks, Thejas
          Hide
          Jie Li added a comment -

          Here is the result of the TPC-H Benchmark (PIG-2397).

          100GB Q1
          Hive 0.7.1: 2334 seconds
          Pig 0.9.0: 9856
          Pig 0.10: 7809

          100GB Q6:
          Hive 0.7.1: 1100
          Pig 0.9.0: 1591
          Pig 0.10: 1696

          10GB Q1:
          Hive: 408.47
          Pig 0.9: 1219
          Pig 0.10: 1026

          10GB Q6:
          Hive:157.52
          Pig 0.9: 214
          Pig 0.10:234

          Results of two dataset size are consistent. Q1 (15%~20% improvement) has a group-by on two columns, with only 4 different keys, and its filter is not selective at all (98% passed).

          Q6(5%-10% overhead) has a group all, and the filter is very selective (only 1/60 passed). This may explains why Q6 doesn't benefit from the hash agg.

          While we see great improvement for Q1, we still take 2-3x time as Hive. I can provide more info for further analysis.

          Show
          Jie Li added a comment - Here is the result of the TPC-H Benchmark ( PIG-2397 ). 100GB Q1 Hive 0.7.1: 2334 seconds Pig 0.9.0: 9856 Pig 0.10: 7809 100GB Q6: Hive 0.7.1: 1100 Pig 0.9.0: 1591 Pig 0.10: 1696 10GB Q1: Hive: 408.47 Pig 0.9: 1219 Pig 0.10: 1026 10GB Q6: Hive:157.52 Pig 0.9: 214 Pig 0.10:234 Results of two dataset size are consistent. Q1 (15%~20% improvement) has a group-by on two columns, with only 4 different keys, and its filter is not selective at all (98% passed). Q6(5%-10% overhead) has a group all, and the filter is very selective (only 1/60 passed). This may explains why Q6 doesn't benefit from the hash agg. While we see great improvement for Q1, we still take 2-3x time as Hive. I can provide more info for further analysis.
          Hide
          Thejas M Nair added a comment -

          Q6(5%-10% overhead) has a group all, and the filter is very selective (only 1/60 passed). This may explains why Q6 doesn't benefit from the hash agg.

          group-all is the best case for hash-agg. So hash-agg should not slow down Q6. It must be something else that is wrong in 0.10 (can be a hash agg implementation issue as well).
          Is this the result of one run, do you see lot of variation (>5-10%) in runtime between runs ?

          Show
          Thejas M Nair added a comment - Q6(5%-10% overhead) has a group all, and the filter is very selective (only 1/60 passed). This may explains why Q6 doesn't benefit from the hash agg. group-all is the best case for hash-agg. So hash-agg should not slow down Q6. It must be something else that is wrong in 0.10 (can be a hash agg implementation issue as well). Is this the result of one run, do you see lot of variation (>5-10%) in runtime between runs ?
          Hide
          Jie Li added a comment -

          I ran it only once, but for both 10GB (10% overhead) and 100GB (6% overhead). Because the filter before the group-by-all is very selective, only 1/60 records are sent to group-by-all and processed by the hash-agg, so the dominant cost should be the reading and filtering. Maybe I can remove the filter to see the benefit of the hash-agg?

          Show
          Jie Li added a comment - I ran it only once, but for both 10GB (10% overhead) and 100GB (6% overhead). Because the filter before the group-by-all is very selective, only 1/60 records are sent to group-by-all and processed by the hash-agg, so the dominant cost should be the reading and filtering. Maybe I can remove the filter to see the benefit of the hash-agg?
          Hide
          Thejas M Nair added a comment -

          Yes, the dominant cost is likely to be reading and filtering. So benefit of using hash-bashed agg is likely to go unnoticed, but at the same time, there should not be any degradation in performance with 0.10. So I was wondering if the 5-10% degradation seen with 0.10 is noise, or something that needs further investigation.

          Show
          Thejas M Nair added a comment - Yes, the dominant cost is likely to be reading and filtering. So benefit of using hash-bashed agg is likely to go unnoticed, but at the same time, there should not be any degradation in performance with 0.10. So I was wondering if the 5-10% degradation seen with 0.10 is noise, or something that needs further investigation.
          Hide
          Jie Li added a comment -

          I repeat Q6 + 10GB data on different versions, each with three times and below are the average times:

          pig 0.9: 214 sec
          pig 0.10 without hashagg: 224 sec
          pig 0.10 with hashagg: 224 sec

          So the slight degradation should not be related to the hashagg.

          Show
          Jie Li added a comment - I repeat Q6 + 10GB data on different versions, each with three times and below are the average times: pig 0.9: 214 sec pig 0.10 without hashagg: 224 sec pig 0.10 with hashagg: 224 sec So the slight degradation should not be related to the hashagg.
          Hide
          Thejas M Nair added a comment -

          Thanks. Looks like there is something in 0.10 causing around 5% slowness, created PIG-2434 to investigate that. (Unfortunately, I don't think I will be able to look at it soon enough).

          Show
          Thejas M Nair added a comment - Thanks. Looks like there is something in 0.10 causing around 5% slowness, created PIG-2434 to investigate that. (Unfortunately, I don't think I will be able to look at it soon enough).

            People

            • Assignee:
              Thejas M Nair
              Reporter:
              Thejas M Nair
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development