Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
tez-branch
-
None
Description
Currently, DISTINCT is implemented in a straightforward manner per https://issues.apache.org/jira/browse/PIG-3538.
However, we can implement two types of combiner optimizations for DISTINCT, just as the MRCompiler does for map-reduce:
1. A simple DistinctCombiner that throws away the duplicate tuples
2. An optimizer that transforms certain uses of DISTINCT into an algebraic udf form