Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
This is one of the very frequently used patterns
grouped_data_set = group data_set by id; capped_data_set = foreach grouped_data_set { ordered = order joined_data_set by timestamp desc; capped = limit ordered $num; generate flatten(capped); };
But this performs very poorly when there are millions of rows for a key in the groupby with lot of spills. This can be easily optimized by pushing the limit into the InternalSortedBag and maintain only $num records any time and avoid memory pressure.