[ARROW-12688] [R] Use DuckDB to query an Arrow Dataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 6.0.0
Component/s: R
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/28434

Description

DuckDB can read data from an Arrow C-interface stream. Once we can provide that struct from R, presumably DuckDB could query on that stream.

A first step is just connecting the pieces. A second step would be to handle parts of the DuckDB query and push down filtering/projection to Arrow.

We need a function something like this:

#' Run a DuckDB query on Arrow data
#'
#' @param .data An `arrow` data object: `Dataset`, `Table`, `RecordBatch`, or 
#' an `arrow_dplyr_query` containing filter/mutate/etc. expressions
#' @return A `duckdb::duckdb_connection`
to_duckdb <- function(.data) {
  # ARROW-12687: [C++][Python][Dataset] Convert Scanner into a RecordBatchReader 
  reader <- Scanner$create(.data)$ToRecordBatchReader()

  # ARROW-12689: [R] Implement ArrowArrayStream C interface
  stream_ptr <- allocate_arrow_array_stream()
  on.exit(delete_arrow_array_stream(stream_ptr))
  ExportRecordBatchReader(x, stream_ptr)

  # TODO: DuckDB method to create table/connection from ArrowArrayStream ptr
  duckdb::duck_connection_from_arrow_stream(stream_ptr)
}

Assuming this existed, we could do something like (a variation of https://arrow.apache.org/docs/r/articles/dataset.html):

ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
ds %>%
  filter(total_amount > 100, year == 2015) %>%
  select(tip_amount, total_amount, passenger_count) %>%
  mutate(tip_pct = 100 * tip_amount / total_amount) %>%
  to_duckdb() %>%
  group_by(passenger_count) %>%
  summarise(
    median_tip_pct = median(tip_pct),
    n = n()
  )

and duckdb would do the aggregation while the data reading, predicate pushdown, filtering, and projection would happen in Arrow. Or you could do dbGetQuery(ds, "SOME SQL") and that would evaluate on arrow data.

Attachments

Issue Links

depends upon

ARROW-12689 [R] Implement ArrowArrayStream C interface

Resolved

ARROW-12940 [R] Expose C interface as R6 methods

Resolved

ARROW-12687 [C++][Python][Dataset] Support C Data Interface with Scanner

Resolved

is related to

ARROW-13472 [R] Remove .engine = "duckdb" argument

Resolved

relates to

ARROW-13465 [R] to_arrow() from duckdb

Resolved

links to

GitHub Pull Request #10780

(1 links to)

Activity

People

Assignee:: Jonathan Keane

Reporter:: Neal Richardson

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 07/May/21 17:04

Updated:: 11/Jan/23 08:28

Resolved:: 26/Jul/21 20:25

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

5h 50m