Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Use the Arrow C++ function partition_nth_indices to optimize dplyr queries like this:
iris %>% Table$create() %>% arrange(desc(Sepal.Length)) %>% head(10) %>% collect()
This query sorts the full table even though it doesn't need to. It could use partition_nth_indices to find the rows containing the top 10 values of Sepal.Length and only collect and sort those 10 rows.
Test to see if this improves performance in practice on larger data.
Attachments
Issue Links
- depends upon
-
ARROW-13973 [C++] Add a SelectKSinkNode
- Resolved
- links to