Details
-
New Feature
-
Status: Closed
-
Minor
-
Resolution: Won't Fix
-
None
-
None
-
None
Description
Currently PySpark has takeOrdered function that returns top N records. However often you want to extract top N records within each group. This can be easily implemented using combineByKey operation and using fixed size heap to capture top N within each group. A working solution can be found over [here](https://ragrawal.wordpress.com/2015/08/25/pyspark-top-n-records-in-each-group/)