Pig
  1. Pig
  2. PIG-2228

support partial aggregation in map task

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      In map aggregation will aggregate records before it sends them to combiner. It reduces the serializing/deserializing costs in use of combiner, by sending fewer records to the combiner.
      It has been shown to speed up the map task in a group by query by up to 50%.
      Since this is a very new feature, it currently turned off by default. To turn it on, set the property pig.exec.mapPartAgg to true.

      If the group-by keys used for grouping don't result in sufficient reduction in number of records, the performance might be worse with this feature turned on. To prevent that from happening, the feature turns itself off if the reduction in records sent to combiner is not more than a configurable threshold. This threshold can be set using the property pig.exec.mapPartAgg.minReduction . It is set to a default value of 10, which means that the number of records that get sent to the combiner should be reduced by a factor of 10.
      Show
      In map aggregation will aggregate records before it sends them to combiner. It reduces the serializing/deserializing costs in use of combiner, by sending fewer records to the combiner. It has been shown to speed up the map task in a group by query by up to 50%. Since this is a very new feature, it currently turned off by default. To turn it on, set the property pig.exec.mapPartAgg to true. If the group-by keys used for grouping don't result in sufficient reduction in number of records, the performance might be worse with this feature turned on. To prevent that from happening, the feature turns itself off if the reduction in records sent to combiner is not more than a configurable threshold. This threshold can be set using the property pig.exec.mapPartAgg.minReduction . It is set to a default value of 10, which means that the number of records that get sent to the combiner should be reduced by a factor of 10.

      Description

      Introduction

      Pig does (sort based) partial aggregation in map side through the use of combiner. MR serializes the output of map to a buffer, sorts it on the keys, deserializes and passes the values grouped on the keys to combiner phase. The same work of combiner can be done in the map phase itself by using a hash-map on the keys. This hash based (partial) aggregation can be done with or without a combiner phase.

      Benefits

      It will send fewer records to combiner and thereby -

      • Save on cost of serializing and de-serializing
      • Save on cost of lock calls on the combiner input buffer. (I have found this to be a significant cost for a query that was doing multiple group-by's in a single MR job. -Thejas)
      • The problem of running out of memory in reduce side, for queries like COUNT(distinct col) can be avoided. The OOM issue happens because very large records get created after the combiner run on merged reduce input. In case of combiner, you have no way of telling MR not to combine records in reduce side. The workaround is to disable combiner completely, and the opportunity to reduce map output size is lost.
      • When the foreach after group-by has both algebraic and non-algebraic functions, or if a bag is being projected, the combiner is not used. This is because the data size reduction in typical cases are not significant enough to justify the additional (de)serialization costs. But hash based aggregation can be used in such cases as well.
      • It is possible to turn off the in-map combine automatically if there is not enough 'combination' that is taking place to justify the overhead of the in-map combiner. (Idea borrowed from Hive jira.)
      • If input data is sorted, it is possible to do efficient map side (partial) aggregation with in-map combiner.

      Design proposal is here - https://cwiki.apache.org/confluence/display/PIG/PigInMapCombinerProposal

      1. PIG-2228.1.patch
        57 kB
        Thejas M Nair
      2. PIG-2228.2.patch
        76 kB
        Thejas M Nair
      3. PIG-2228.3.patch
        79 kB
        Thejas M Nair
      4. PIG-2228.4.patch
        83 kB
        Thejas M Nair
      5. PIG-2228.5.patch
        85 kB
        Thejas M Nair
      6. PIG-2228.6.patch
        88 kB
        Thejas M Nair

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Thejas M Nair
              Reporter:
              Thejas M Nair
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development