Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.2.0
-
None
-
None
Description
Frequently, users are interested on Top results (especially Top K rows) . This can be implemented efficiently in Pig /Map Reduce settings to deliver rapid results and low Network Bandwidth/Memory usage.
Key point is to prune all data on the map side and keep only small set of rows with Top criteria . We can do it in Algebraic function (combiner) with multiple value output. Only a small data-set gets out of mapper node.
The same idea is applicable to solve variants of this problem:
- An Algebraic Function for 'Top K Rows'
- An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense Rank K')
- TOP K ORDER BY.
Another words implementation is similar to combiners for aggregate functions but instead of one value we get multiple ones.
I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY to clarify details.
Attachments
Attachments
Issue Links
- depends upon
-
PIG-157 Add types and rework execution pipeline
- Closed