Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16776

[R] dplyr::glimpse method for arrow table and datasets

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 9.0.0
    • R

    Description

      When working with Arrow datasets/tables, I often find myself wanting to interactively print or "see" the results of a query or the first few rows of the data without having to fully collect into memory.

      I can perform exploratory data analysis on large out-of-memory datasets via Arrow + dplyr but in order to print the returned values I have to collect() into memory or send to_duckdb().

      • compute() - returns number of rows/columns, but no data
      • collect() - returns data fully into memory, can be combined with head()
      • to_duckdb() - keeps data out of memory, always returns top 10 rows and all columns, optionally increase/decrease number of printed rows

      While to_duckdb() gives me the ability to do true EDA, it seems counterintuitive to need to send the arrow table over to a duckdb database just to see the glimpse()/head() equivalent.

      My feature request is that there is a dplyr::glimpse() method that will lazily print the first few values of table/dataset. The expected output would be something like the below.

      ``` r
      library(dplyr)
      library(arrow)

      mtcars %>% arrow::write_parquet("mtcars.parquet")
      car_ds <- arrow::open_dataset("mtcars.parquet")

      car_ds %>%
      glimpse()

      Rows: ??
      Columns: 11
      $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, …
      $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, …
      $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 36…
      $ hp <dbl> 110, 110, 93, 110, 175, 105, 2…
      $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, …
      $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.…
      $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17…
      $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, …
      $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
      $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, …
      $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, …
      ```

      Currently glimpse() will return a list output where the majority of the output is erroneous to the actual data/values.

      ``` r
      library(dplyr)
      library(arrow)

      mtcars %>% arrow::write_parquet("mtcars.parquet")
      car_ds <- arrow::open_dataset("mtcars.parquet")

      car_ds %>% 
        glimpse()
      #> Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' <FileSystemDataset>
      #>   Inherits from: <Dataset>
      #>   Public:
      #>     .:xp:.: externalptr
      #>     .class_title: function () 
      #>     clone: function (deep = FALSE) 
      #>     files: active binding
      #>     filesystem: active binding
      #>     format: active binding
      #>     initialize: function (xp) 
      #>     invalidate: function () 
      #>     metadata: active binding
      #>     NewScan: function () 
      #>     num_cols: active binding
      #>     num_rows: active binding
      #>     pointer: function () 
      #>     print: function (...) 
      #>     schema: active binding
      #>     set_pointer: function (xp) 
      #>     ToString: function () 
      #>     type: active binding

      car_ds %>%
        filter(cyl == 6) %>%
        glimpse()
      #> List of 7
      #>  $ mpg :Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' <FileSystemDataset>
      #>   Inherits from: <Dataset>
      #>   Public:
      #>     .:xp:.: externalptr
      #>     .class_title: function () 
      #>     clone: function (deep = FALSE) 
      #>     files: active binding
      #>     filesystem: active binding
      #>     format: active binding
      #>     initialize: function (xp) 
      #>     invalidate: function () 
      #>     metadata: active binding
      #>     NewScan: function () 
      #>     num_cols: active binding
      #>     num_rows: active binding
      #>     pointer: function () 
      #>     print: function (...) 
      #>     schema: active binding
      #>     set_pointer: function (xp) 
      #>     ToString: function () 
      #>     type: active binding 
      #>  $ cyl :List of 11
      #>   ..$ mpg :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
      #>   Inherits from: <ArrowObject>
      #>   Public:
      #>     .:xp:.: externalptr
      #>     cast: function (to_type, safe = TRUE, ...) 
      #>     clone: function (deep = FALSE) 
      #>     Equals: function (other, ...) 
      #>     field_name: active binding
      #>     initialize: function (xp) 
      #>     invalidate: function () 
      #>     pointer: function () 
      #>     print: function (...) 
      #>     schema: Schema, ArrowObject, R6
      #>     set_pointer: function (xp) 
      #>     ToString: function () 
      #>     type: function (schema = self$schema) 
      #>     type_id: function (schema = self$schema)  
      #>   ..$ cyl :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
      #>   Inherits from: <ArrowObject>
      #>   Public:
      #>     .:xp:.: externalptr
      #>     cast: function (to_type, safe = TRUE, ...) 
      #>     clone: function (deep = FALSE) 
      #>     Equals: function (other, ...) 
      #>     field_name: active binding
      #>     initialize: function (xp) 
      #>     invalidate: function () 
      #>     pointer: function () 
      #>     print: function (...) 
      #>     schema: Schema, ArrowObject, R6
      #>     set_pointer: function (xp) 
      #>     ToString: function () 
      #>     type: function (schema = self$schema) 
      #>     type_id: function (schema = self$schema)  
      #>   ..$ disp:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
      #>   Inherits from: <ArrowObject>
      #>   Public:
      #>     .:xp:.: externalptr
      #>     cast: function (to_type, safe = TRUE, ...) 
      #>     clone: function (deep = FALSE) 
      #>     Equals: function (other, ...) 
      #>     field_name: active binding
      #>     initialize: function (xp) 
      #>     invalidate: function () 
      #>     pointer: function () 
      #>     print: function (...) 
      #>     schema: Schema, ArrowObject, R6
      #>     set_pointer: function (xp) 
      #>     ToString: function () 
      #>     type: function (schema = self$schema) 
      #>     type_id: function (schema = self$schema)  
      #>   ..$ hp  :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
      #>   Inherits from: <ArrowObject>
      #>   Public:
      #>     .:xp:.: externalptr
      #>     cast: function (to_type, safe = TRUE, ...) 
      #>     clone: function (deep = FALSE) 
      #>     Equals: function (other, ...) 
      #>     field_name: active binding
      #>     initialize: function (xp) 
      #>     invalidate: function () 
      #>     pointer: function () 
      #>     print: function (...) 
      #>     schema: Schema, ArrowObject, R6
      #>     set_pointer: function (xp) 
      #>     ToString: function () 
      #>     type: function (schema = self$schema) 
      #>     type_id: function (schema = self$schema)  
      #>   ..$ drat:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
      #>   Inherits from: <ArrowObject>
      #>   Public:
      #>     .:xp:.: externalptr
      #>     cast: function (to_type, safe = TRUE, ...) 
      #>     clone: function (deep = FALSE) 
      #>     Equals: function (other, ...) 
      #>     field_name: active binding
      #>     initialize: function (xp) 
      #>     invalidate: function () 
      #>     pointer: function () 
      #>     print: function (...) 
      #>     schema: Schema, ArrowObject, R6
      #>     set_pointer: function (xp) 
      #>     ToString: function () 
      #>     type: function (schema = self$schema) 
      #>     type_id: function (schema = self$schema)  
      #>   ..$ wt  :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
      #>   Inherits from: <ArrowObject>
      #>   Public:
      #>     .:xp:.: externalptr
      #>     cast: function (to_type, safe = TRUE, ...) 
      #>     clone: function (deep = FALSE) 
      #>     Equals: function (other, ...) 
      #>     field_name: active binding
      #>     initialize: function (xp) 
      #>     invalidate: function () 
      #>     pointer: function () 
      #>     print: function (...) 
      #>     schema: Schema, ArrowObject, R6
      #>     set_pointer: function (xp) 
      #>     ToString: function () 
      #>     type: function (schema = self$schema) 
      #>     type_id: function (schema = self$schema)  
      #>   ..$ qsec:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
      #>   Inherits from: <ArrowObject>
      #>   Public:
      #>     .:xp:.: externalptr
      #>     cast: function (to_type, safe = TRUE, ...) 
      #>     clone: function (deep = FALSE) 
      #>     Equals: function (other, ...) 
      #>     field_name: active binding
      #>     initialize: function (xp) 
      #>     invalidate: function () 
      #>     pointer: function () 
      #>     print: function (...) 
      #>     schema: Schema, ArrowObject, R6
      #>     set_pointer: function (xp) 
      #>     ToString: function () 
      #>     type: function (schema = self$schema) 
      #>     type_id: function (schema = self$schema)  
      #>   ..$ vs  :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
      #>   Inherits from: <ArrowObject>
      #>   Public:
      #>     .:xp:.: externalptr
      #>     cast: function (to_type, safe = TRUE, ...) 
      #>     clone: function (deep = FALSE) 
      #>     Equals: function (other, ...) 
      #>     field_name: active binding
      #>     initialize: function (xp) 
      #>     invalidate: function () 
      #>     pointer: function () 
      #>     print: function (...) 
      #>     schema: Schema, ArrowObject, R6
      #>     set_pointer: function (xp) 
      #>     ToString: function () 
      #>     type: function (schema = self$schema) 
      #>     type_id: function (schema = self$schema)  
      #>   ..$ am  :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
      #>   Inherits from: <ArrowObject>
      #>   Public:
      #>     .:xp:.: externalptr
      #>     cast: function (to_type, safe = TRUE, ...) 
      #>     clone: function (deep = FALSE) 
      #>     Equals: function (other, ...) 
      #>     field_name: active binding
      #>     initialize: function (xp) 
      #>     invalidate: function () 
      #>     pointer: function () 
      #>     print: function (...) 
      #>     schema: Schema, ArrowObject, R6
      #>     set_pointer: function (xp) 
      #>     ToString: function () 
      #>     type: function (schema = self$schema) 
      #>     type_id: function (schema = self$schema)  
      #>   ..$ gear:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
      #>   Inherits from: <ArrowObject>
      #>   Public:
      #>     .:xp:.: externalptr
      #>     cast: function (to_type, safe = TRUE, ...) 
      #>     clone: function (deep = FALSE) 
      #>     Equals: function (other, ...) 
      #>     field_name: active binding
      #>     initialize: function (xp) 
      #>     invalidate: function () 
      #>     pointer: function () 
      #>     print: function (...) 
      #>     schema: Schema, ArrowObject, R6
      #>     set_pointer: function (xp) 
      #>     ToString: function () 
      #>     type: function (schema = self$schema) 
      #>     type_id: function (schema = self$schema)  
      #>   ..$ carb:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
      #>   Inherits from: <ArrowObject>
      #>   Public:
      #>     .:xp:.: externalptr
      #>     cast: function (to_type, safe = TRUE, ...) 
      #>     clone: function (deep = FALSE) 
      #>     Equals: function (other, ...) 
      #>     field_name: active binding
      #>     initialize: function (xp) 
      #>     invalidate: function () 
      #>     pointer: function () 
      #>     print: function (...) 
      #>     schema: Schema, ArrowObject, R6
      #>     set_pointer: function (xp) 
      #>     ToString: function () 
      #>     type: function (schema = self$schema) 
      #>     type_id: function (schema = self$schema)  
      #>  $ disp:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
      #>   Inherits from: <ArrowObject>
      #>   Public:
      #>     .:xp:.: externalptr
      #>     cast: function (to_type, safe = TRUE, ...) 
      #>     clone: function (deep = FALSE) 
      #>     Equals: function (other, ...) 
      #>     field_name: active binding
      #>     initialize: function (xp) 
      #>     invalidate: function () 
      #>     pointer: function () 
      #>     print: function (...) 
      #>     schema: Schema, ArrowObject, R6
      #>     set_pointer: function (xp) 
      #>     ToString: function () 
      #>     type: function (schema = self$schema) 
      #>     type_id: function (schema = self$schema)  
      #>  $ hp  : chr(0) 
      #>  $ drat: NULL
      #>  $ wt  : list()
      #>  $ qsec: logi(0) 
      #>  - attr(*, "class")= chr "arrow_dplyr_query"
      ```

      <sup>Created on 2022-06-07 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>

      Attachments

        Issue Links

          Activity

            People

              npr Neal Richardson
              jthomasmock Thomas Mock
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2.5h
                  2.5h