Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Hive currently supports aggregation of lists "in order of input rows" with the UDF collect_list. Unfortunately, the order is not well defined when map-side aggregations are used.
Hive could support collecting lists in user-defined order by providing a UDF
COLLECT_LIST_SORTED(valueColumn, sortColumn[, limit]), that would return a list of values sorted in a user defined order. An optional limit parameter can restrict this to the n first values within that order.
Especially in the limit case, this can be efficiently pre-aggregated and reduces the amount of data transferred to reducers.