Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12763

[R] Optimize dplyr queries that use head/tail after arrange

    XMLWordPrintableJSON

Details

    Description

      Use the Arrow C++ function partition_nth_indices to optimize dplyr queries like this:

      iris %>%
        Table$create() %>% 
        arrange(desc(Sepal.Length)) %>%
        head(10) %>%
        collect()
      

      This query sorts the full table even though it doesn't need to. It could use partition_nth_indices to find the rows containing the top 10 values of Sepal.Length and only collect and sort those 10 rows.

      Test to see if this improves performance in practice on larger data.

      Attachments

        Issue Links

          Activity

            People

              npr Neal Richardson
              icook Ian Cook
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m