[PIG-171] Top K - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.2.0
Fix Version/s: 0.2.0
Component/s: None
Labels:
None

Description

Frequently, users are interested on Top results (especially Top K rows) . This can be implemented efficiently in Pig /Map Reduce settings to deliver rapid results and low Network Bandwidth/Memory usage.

Key point is to prune all data on the map side and keep only small set of rows with Top criteria . We can do it in Algebraic function (combiner) with multiple value output. Only a small data-set gets out of mapper node.

The same idea is applicable to solve variants of this problem:

An Algebraic Function for 'Top K Rows'
An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense Rank K')
TOP K ORDER BY.

Another words implementation is similar to combiners for aggregate functions but instead of one value we get multiple ones.

I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY to clarify details.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

limit1.patch
20/Jun/08 01:23
20 kB
Daniel Dai
limit2.patch
01/Jul/08 17:35
29 kB
Daniel Dai
limit3.patch
09/Jul/08 06:15
64 kB
Daniel Dai

Issue Links

depends upon

PIG-157 Add types and rework execution pipeline

Closed

Activity

People

Assignee:: Daniel Dai

Reporter:: Amir Youssefi

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 28/Mar/08 04:00

Updated:: 02/May/13 02:29

Resolved:: 25/Jul/08 03:20