Details
-
New Feature
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
Description
Would you consider a PR to add a between method for `arrow_dplyr_query` objects? Even something implemented directly in R harnesses the arrow speed. Here is what I am thinking:
Typical usage of `between`:
library(dplyr) library(arrow) iris %>% filter(between(Petal.Length, 1, 1.1))
Here is a mocked up version of the method:
between_mock <- function(x, left, right) { if (length(left) != 1) { rlang::abort("`left` must be length 1") } if (length(right) != 1) { rlang::abort("`right` must be length 1") }x >= left & x <= right }
I think because `dplyr` uses C++ to efficiently do this, `between` doesn't work out of the box:
open_dataset("nyc-taxi", partitioning = "year") %>% filter(year == 2014) %>% select(year, fare_amount) %>% filter(between(fare_amount, 10, 11)) %>% collect() Error: Filter expression not supported for Arrow Datasets: between(fare_amount, 10, 11) Call collect() first to pull data into R. In addition: Warning message: between() called on numeric vector with S3 class Backtrace: x 1. +-[ `%>%`(...) ] 2. +-[ dplyr::collect(...) ] 3. +-[ dplyr::filter(...) ] 4. \-arrow:::filter.arrow_dplyr_query(...)
But even my simple implementation works fine:
open_dataset("nyc-taxi", partitioning = "year") %>% filter(year == 2014) %>% select(year, fare_amount) %>% filter(between_mock(fare_amount, 10, 11)) %>% collect()
Attachments
Issue Links
- links to