Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12688

[R] Use DuckDB to query an Arrow Dataset

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 6.0.0
    • R

    Description

      DuckDB can read data from an Arrow C-interface stream. Once we can provide that struct from R, presumably DuckDB could query on that stream.

      A first step is just connecting the pieces. A second step would be to handle parts of the DuckDB query and push down filtering/projection to Arrow.

      We need a function something like this:

      #' Run a DuckDB query on Arrow data
      #'
      #' @param .data An `arrow` data object: `Dataset`, `Table`, `RecordBatch`, or 
      #' an `arrow_dplyr_query` containing filter/mutate/etc. expressions
      #' @return A `duckdb::duckdb_connection`
      to_duckdb <- function(.data) {
        # ARROW-12687: [C++][Python][Dataset] Convert Scanner into a RecordBatchReader 
        reader <- Scanner$create(.data)$ToRecordBatchReader()
      
        # ARROW-12689: [R] Implement ArrowArrayStream C interface
        stream_ptr <- allocate_arrow_array_stream()
        on.exit(delete_arrow_array_stream(stream_ptr))
        ExportRecordBatchReader(x, stream_ptr)
      
        # TODO: DuckDB method to create table/connection from ArrowArrayStream ptr
        duckdb::duck_connection_from_arrow_stream(stream_ptr)
      }
      

      Assuming this existed, we could do something like (a variation of https://arrow.apache.org/docs/r/articles/dataset.html):

      ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
      ds %>%
        filter(total_amount > 100, year == 2015) %>%
        select(tip_amount, total_amount, passenger_count) %>%
        mutate(tip_pct = 100 * tip_amount / total_amount) %>%
        to_duckdb() %>%
        group_by(passenger_count) %>%
        summarise(
          median_tip_pct = median(tip_pct),
          n = n()
        )
      

      and duckdb would do the aggregation while the data reading, predicate pushdown, filtering, and projection would happen in Arrow. Or you could do dbGetQuery(ds, "SOME SQL") and that would evaluate on arrow data.

      Attachments

        Issue Links

          Activity

            People

              jonkeane Jonathan Keane
              npr Neal Richardson
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 50m
                  5h 50m