[ARROW-12763] [R] Optimize dplyr queries that use head/tail after arrange - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 6.0.0
Component/s: R
Labels:
- pull-request-available
- query-engine

External issue URL:
https://github.com/apache/arrow/issues/28505

Description

Use the Arrow C++ function partition_nth_indices to optimize dplyr queries like this:

iris %>%
  Table$create() %>% 
  arrange(desc(Sepal.Length)) %>%
  head(10) %>%
  collect()

This query sorts the full table even though it doesn't need to. It could use partition_nth_indices to find the rows containing the top 10 values of Sepal.Length and only collect and sort those 10 rows.

Test to see if this improves performance in practice on larger data.

Attachments

Issue Links

depends upon

ARROW-13973 [C++] Add a SelectKSinkNode

Resolved

links to

GitHub Pull Request #11405

Activity

People

Assignee:: Neal Richardson

Reporter:: Ian Cook

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/May/21 20:32

Updated:: 11/Jan/23 08:28

Resolved:: 15/Oct/21 19:45

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 40m