Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.2.0
-
None
-
None
-
Patch Available
Description
Currently Pig uses the combiner only when there is foreach following a group when the elements in the foreach generate have the following characteristics:
1) simple project of the "group" column
2) Algebraic UDF
The above conditions exclude use of the combiner for distinct aggregates - the distinct operation itself is combinable (irrespective of whether it feeds to an algebraic or non algebraic udf). So if the following foreach should also be combinable:
.. b = group a by $0; c = foreach b generate { x = distinct a; generate group, COUNT(x), SUM(x.$1) }
The combiner optimizer should cause the distinct to be combined and the final combine output should feed the COUNT() and SUM() in the reduce.