[SPARK-10223] Add takeOrderedByKey function to extract top N records within each group - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

Currently PySpark has takeOrdered function that returns top N records. However often you want to extract top N records within each group. This can be easily implemented using combineByKey operation and using fixed size heap to capture top N within each group. A working solution can be found over [here](https://ragrawal.wordpress.com/2015/08/25/pyspark-top-n-records-in-each-group/)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Ritesh Agrawal

Shepherd:: DB Tsai

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 25/Aug/15 13:05

Updated:: 16/Oct/16 12:34

Resolved:: 16/Oct/16 12:34