Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8813

[R] Implementing tidyr interface

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • R

    Description

      I think it would be reasonable to implement an interface to the tidyr package. The implementation would allow to lazily process ArrowTables before put it back into the memory. However, currently you need to collect the table first before applying tidyr methods. The following code chunk shows an example routine:

      library(magrittr)
      arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
      nested_df <-
         arrow_table %>%
         dplyr::select(ID, 4:7, Value) %>%
         dplyr::filter(Value >= 5) %>%
         dplyr::group_by(ID) %>%
         dplyr::collect() %>%
         tidyr::nest()

      The main focus might be the following three methods:

      • tidyr::[un]nest(),
      • tidyr::pivot_[longer|wider](), and
      • tidyr::seperate().

      I suppose the last two can be fairly quickly implemented, but tidyr::nest() and tidyr::unnest() cannot be implement before conversion to List<Struct> will be accessible.

      Attachments

        Activity

          People

            Unassigned Unassigned
            domiden Dominic Dennenmoser
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: