Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
DuckDB can read data from an Arrow C-interface stream. Once we can provide that struct from R, presumably DuckDB could query on that stream.
A first step is just connecting the pieces. A second step would be to handle parts of the DuckDB query and push down filtering/projection to Arrow.
We need a function something like this:
#' Run a DuckDB query on Arrow data #' #' @param .data An `arrow` data object: `Dataset`, `Table`, `RecordBatch`, or #' an `arrow_dplyr_query` containing filter/mutate/etc. expressions #' @return A `duckdb::duckdb_connection` to_duckdb <- function(.data) { # ARROW-12687: [C++][Python][Dataset] Convert Scanner into a RecordBatchReader reader <- Scanner$create(.data)$ToRecordBatchReader() # ARROW-12689: [R] Implement ArrowArrayStream C interface stream_ptr <- allocate_arrow_array_stream() on.exit(delete_arrow_array_stream(stream_ptr)) ExportRecordBatchReader(x, stream_ptr) # TODO: DuckDB method to create table/connection from ArrowArrayStream ptr duckdb::duck_connection_from_arrow_stream(stream_ptr) }
Assuming this existed, we could do something like (a variation of https://arrow.apache.org/docs/r/articles/dataset.html):
ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) ds %>% filter(total_amount > 100, year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = 100 * tip_amount / total_amount) %>% to_duckdb() %>% group_by(passenger_count) %>% summarise( median_tip_pct = median(tip_pct), n = n() )
and duckdb would do the aggregation while the data reading, predicate pushdown, filtering, and projection would happen in Arrow. Or you could do dbGetQuery(ds, "SOME SQL") and that would evaluate on arrow data.
Attachments
Issue Links
- depends upon
-
ARROW-12689 [R] Implement ArrowArrayStream C interface
- Resolved
-
ARROW-12940 [R] Expose C interface as R6 methods
- Resolved
-
ARROW-12687 [C++][Python][Dataset] Support C Data Interface with Scanner
- Resolved
- is related to
-
ARROW-13472 [R] Remove .engine = "duckdb" argument
- Resolved
- relates to
-
ARROW-13465 [R] to_arrow() from duckdb
- Resolved
- links to