Pig
  1. Pig
  2. PIG-4002

Disable combiner when map-side aggregation is used

    Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.13.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This may be controversial, so I'd like others' opinions on this.

      It is not currently possible to disable the combiner and use map-side aggregation at the same time. This is a problematic since map-side aggregation effectively combines in the mapper, so running the combiner adds expensive combiner execution (combiner requires deserialization & reserialization) for little to no value.

      PIG-2829 had a patch to disable the combiner when map-side aggregation is used (along with some other changes). This was never integrated because the map-side aggregation code was redone while this was in progress.

      1. PIG-4002-1.patch
        5 kB
        Travis Woodruff

        Activity

        Hide
        Travis Woodruff added a comment -

        This is an updated version of 2829.separate.options.patch from PIG-2829.

        Fixes a bug in that patch that defaulted pig.exec.nocombiner to false for all jobs.

        Show
        Travis Woodruff added a comment - This is an updated version of 2829.separate.options.patch from PIG-2829 . Fixes a bug in that patch that defaulted pig.exec.nocombiner to false for all jobs.
        Hide
        Travis Woodruff added a comment -

        The drawback of doing this is that it removes the combiner plan completely, so if in-memory aggregation is disabled at execution time (by POPartialAgg.checkSizeReduction()), the combiner won't be used. One could argue that this is acceptable since if in-memory aggregation didn't provide a substantial reduction, the combiner wouldn't either.

        Show
        Travis Woodruff added a comment - The drawback of doing this is that it removes the combiner plan completely, so if in-memory aggregation is disabled at execution time (by POPartialAgg.checkSizeReduction()), the combiner won't be used. One could argue that this is acceptable since if in-memory aggregation didn't provide a substantial reduction, the combiner wouldn't either.
        Hide
        Travis Woodruff added a comment -

        Oops... negative properties strike again. earlier comment should say

        Fixes a bug in that patch that defaulted pig.exec.nocombiner to true for all jobs.

        Show
        Travis Woodruff added a comment - Oops... negative properties strike again. earlier comment should say Fixes a bug in that patch that defaulted pig.exec.nocombiner to true for all jobs.
        Hide
        Cheolsoo Park added a comment -

        Thank you for the patch. I agree with the idea that two optimizations should be configurable independently. But I am not sure that removing combiner plan after CombinerOptimizer runs is the right way to fix it.

        Your CombinerPlanRemover doesn't seem to undo any changes made to the reduce plan by CombinerOptimizer. For example, CombinerPackager isn't unset-

        // Change the package operator in the reduce plan to
        // be the POCombiner package, as it needs to act
        // differently than the regular package operator.
        pack.setPkgr(pkgr.clone());
        

        Doesn't this have any side effect? I am no expert of this area of code either, so please correct me if I am wrong.

        Btw, your patch doesn't apply nicely to trunk. Also, please add the Apache header to every new file.

        Show
        Cheolsoo Park added a comment - Thank you for the patch. I agree with the idea that two optimizations should be configurable independently. But I am not sure that removing combiner plan after CombinerOptimizer runs is the right way to fix it. Your CombinerPlanRemover doesn't seem to undo any changes made to the reduce plan by CombinerOptimizer. For example, CombinerPackager isn't unset- // Change the package operator in the reduce plan to // be the POCombiner package , as it needs to act // differently than the regular package operator . pack.setPkgr(pkgr.clone()); Doesn't this have any side effect? I am no expert of this area of code either, so please correct me if I am wrong. Btw, your patch doesn't apply nicely to trunk. Also, please add the Apache header to every new file.
        Hide
        Daniel Dai added a comment -

        Checked with Thejas M Nair, the primary reason for combiner is to use in the reduce side:

        • In map side, POPartialAgg will be used for aggregation
        • In reduce side, combiner will be used for aggregation if we have large number of maps

        Ideally we want a flag to only use combiner in reduce side but not map side.

        However, Rohini Palaniswamy point to me combiner in reduce side is not working (MAPREDUCE-5221). As long as the bug is still there, combiner will not be useful when map side aggregation is used, and we shall disable it.

        Show
        Daniel Dai added a comment - Checked with Thejas M Nair , the primary reason for combiner is to use in the reduce side: In map side, POPartialAgg will be used for aggregation In reduce side, combiner will be used for aggregation if we have large number of maps Ideally we want a flag to only use combiner in reduce side but not map side. However, Rohini Palaniswamy point to me combiner in reduce side is not working ( MAPREDUCE-5221 ). As long as the bug is still there, combiner will not be useful when map side aggregation is used, and we shall disable it.
        Hide
        Rohini Palaniswamy added a comment -

        Tez has combiner working on both map and reduce. We should only turn it off on the map side in Tez. This patch I see only addresses MR. We need to address Tez as well or open a separate jira for that.

        Show
        Rohini Palaniswamy added a comment - Tez has combiner working on both map and reduce. We should only turn it off on the map side in Tez. This patch I see only addresses MR. We need to address Tez as well or open a separate jira for that.
        Hide
        Daniel Dai added a comment -

        We'd better tez has an option to disable map side/reduce side combiner separately. Shall we create a Tez ticket?

        Show
        Daniel Dai added a comment - We'd better tez has an option to disable map side/reduce side combiner separately. Shall we create a Tez ticket?
        Hide
        Rohini Palaniswamy added a comment -

        If we set differently on InputDescriptor and OutputDescriptor it should work I believe.

        Show
        Rohini Palaniswamy added a comment - If we set differently on InputDescriptor and OutputDescriptor it should work I believe.

          People

          • Assignee:
            Travis Woodruff
            Reporter:
            Travis Woodruff
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development