Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12688

[R] Use DuckDB to query an Arrow Dataset

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.0.0
    • Component/s: C++, R

      Description

      DuckDB can read data from an Arrow C-interface stream. Once we can provide that struct from R, presumably DuckDB could query on that stream.

      A first step is just connecting the pieces. A second step would be to handle parts of the DuckDB query and push down filtering/projection to Arrow.

      We need a function something like this:

      #' Run a DuckDB query on Arrow data
      #'
      #' @param .data An `arrow` data object: `Dataset`, `Table`, `RecordBatch`, or 
      #' an `arrow_dplyr_query` containing filter/mutate/etc. expressions
      #' @return A `duckdb::duckdb_connection`
      to_duckdb <- function(.data) {
        # ARROW-12687: [C++][Python][Dataset] Convert Scanner into a RecordBatchReader 
        reader <- Scanner$create(.data)$ToRecordBatchReader()
      
        # ARROW-12689: [R] Implement ArrowArrayStream C interface
        stream_ptr <- allocate_arrow_array_stream()
        on.exit(delete_arrow_array_stream(stream_ptr))
        ExportRecordBatchReader(x, stream_ptr)
      
        # TODO: DuckDB method to create table/connection from ArrowArrayStream ptr
        duckdb::duck_connection_from_arrow_stream(stream_ptr)
      }
      

      Assuming this existed, we could do something like (a variation of https://arrow.apache.org/docs/r/articles/dataset.html):

      ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
      ds %>%
        filter(total_amount > 100, year == 2015) %>%
        select(tip_amount, total_amount, passenger_count) %>%
        mutate(tip_pct = 100 * tip_amount / total_amount) %>%
        to_duckdb() %>%
        group_by(passenger_count) %>%
        summarise(
          median_tip_pct = median(tip_pct),
          n = n()
        )
      

      and duckdb would do the aggregation while the data reading, predicate pushdown, filtering, and projection would happen in Arrow. Or you could do dbGetQuery(ds, "SOME SQL") and that would evaluate on arrow data.

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              jonkeane Jonathan Keane
              Reporter:
              npr Neal Richardson

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 5h 50m
                5h 50m

                  Issue deployment