Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
Description
When working with Arrow datasets/tables, I often find myself wanting to interactively print or "see" the results of a query or the first few rows of the data without having to fully collect into memory.
I can perform exploratory data analysis on large out-of-memory datasets via Arrow + dplyr but in order to print the returned values I have to collect() into memory or send to_duckdb().
- compute() - returns number of rows/columns, but no data
- collect() - returns data fully into memory, can be combined with head()
- to_duckdb() - keeps data out of memory, always returns top 10 rows and all columns, optionally increase/decrease number of printed rows
While to_duckdb() gives me the ability to do true EDA, it seems counterintuitive to need to send the arrow table over to a duckdb database just to see the glimpse()/head() equivalent.
My feature request is that there is a dplyr::glimpse() method that will lazily print the first few values of table/dataset. The expected output would be something like the below.
``` r
library(dplyr)
library(arrow)
mtcars %>% arrow::write_parquet("mtcars.parquet")
car_ds <- arrow::open_dataset("mtcars.parquet")
car_ds %>%
glimpse()
Rows: ??
Columns: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, …
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, …
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 36…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 2…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, …
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17…
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, …
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, …
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, …
```
Currently glimpse() will return a list output where the majority of the output is erroneous to the actual data/values.
``` r
library(dplyr)
library(arrow)
mtcars %>% arrow::write_parquet("mtcars.parquet")
car_ds <- arrow::open_dataset("mtcars.parquet")
car_ds %>%
glimpse()
#> Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' <FileSystemDataset>
#> Inherits from: <Dataset>
#> Public:
#> .:xp:.: externalptr
#> .class_title: function ()
#> clone: function (deep = FALSE)
#> files: active binding
#> filesystem: active binding
#> format: active binding
#> initialize: function (xp)
#> invalidate: function ()
#> metadata: active binding
#> NewScan: function ()
#> num_cols: active binding
#> num_rows: active binding
#> pointer: function ()
#> print: function (...)
#> schema: active binding
#> set_pointer: function (xp)
#> ToString: function ()
#> type: active binding
car_ds %>%
filter(cyl == 6) %>%
glimpse()
#> List of 7
#> $ mpg :Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' <FileSystemDataset>
#> Inherits from: <Dataset>
#> Public:
#> .:xp:.: externalptr
#> .class_title: function ()
#> clone: function (deep = FALSE)
#> files: active binding
#> filesystem: active binding
#> format: active binding
#> initialize: function (xp)
#> invalidate: function ()
#> metadata: active binding
#> NewScan: function ()
#> num_cols: active binding
#> num_rows: active binding
#> pointer: function ()
#> print: function (...)
#> schema: active binding
#> set_pointer: function (xp)
#> ToString: function ()
#> type: active binding
#> $ cyl :List of 11
#> ..$ mpg :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
#> Inherits from: <ArrowObject>
#> Public:
#> .:xp:.: externalptr
#> cast: function (to_type, safe = TRUE, ...)
#> clone: function (deep = FALSE)
#> Equals: function (other, ...)
#> field_name: active binding
#> initialize: function (xp)
#> invalidate: function ()
#> pointer: function ()
#> print: function (...)
#> schema: Schema, ArrowObject, R6
#> set_pointer: function (xp)
#> ToString: function ()
#> type: function (schema = self$schema)
#> type_id: function (schema = self$schema)
#> ..$ cyl :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
#> Inherits from: <ArrowObject>
#> Public:
#> .:xp:.: externalptr
#> cast: function (to_type, safe = TRUE, ...)
#> clone: function (deep = FALSE)
#> Equals: function (other, ...)
#> field_name: active binding
#> initialize: function (xp)
#> invalidate: function ()
#> pointer: function ()
#> print: function (...)
#> schema: Schema, ArrowObject, R6
#> set_pointer: function (xp)
#> ToString: function ()
#> type: function (schema = self$schema)
#> type_id: function (schema = self$schema)
#> ..$ disp:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
#> Inherits from: <ArrowObject>
#> Public:
#> .:xp:.: externalptr
#> cast: function (to_type, safe = TRUE, ...)
#> clone: function (deep = FALSE)
#> Equals: function (other, ...)
#> field_name: active binding
#> initialize: function (xp)
#> invalidate: function ()
#> pointer: function ()
#> print: function (...)
#> schema: Schema, ArrowObject, R6
#> set_pointer: function (xp)
#> ToString: function ()
#> type: function (schema = self$schema)
#> type_id: function (schema = self$schema)
#> ..$ hp :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
#> Inherits from: <ArrowObject>
#> Public:
#> .:xp:.: externalptr
#> cast: function (to_type, safe = TRUE, ...)
#> clone: function (deep = FALSE)
#> Equals: function (other, ...)
#> field_name: active binding
#> initialize: function (xp)
#> invalidate: function ()
#> pointer: function ()
#> print: function (...)
#> schema: Schema, ArrowObject, R6
#> set_pointer: function (xp)
#> ToString: function ()
#> type: function (schema = self$schema)
#> type_id: function (schema = self$schema)
#> ..$ drat:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
#> Inherits from: <ArrowObject>
#> Public:
#> .:xp:.: externalptr
#> cast: function (to_type, safe = TRUE, ...)
#> clone: function (deep = FALSE)
#> Equals: function (other, ...)
#> field_name: active binding
#> initialize: function (xp)
#> invalidate: function ()
#> pointer: function ()
#> print: function (...)
#> schema: Schema, ArrowObject, R6
#> set_pointer: function (xp)
#> ToString: function ()
#> type: function (schema = self$schema)
#> type_id: function (schema = self$schema)
#> ..$ wt :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
#> Inherits from: <ArrowObject>
#> Public:
#> .:xp:.: externalptr
#> cast: function (to_type, safe = TRUE, ...)
#> clone: function (deep = FALSE)
#> Equals: function (other, ...)
#> field_name: active binding
#> initialize: function (xp)
#> invalidate: function ()
#> pointer: function ()
#> print: function (...)
#> schema: Schema, ArrowObject, R6
#> set_pointer: function (xp)
#> ToString: function ()
#> type: function (schema = self$schema)
#> type_id: function (schema = self$schema)
#> ..$ qsec:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
#> Inherits from: <ArrowObject>
#> Public:
#> .:xp:.: externalptr
#> cast: function (to_type, safe = TRUE, ...)
#> clone: function (deep = FALSE)
#> Equals: function (other, ...)
#> field_name: active binding
#> initialize: function (xp)
#> invalidate: function ()
#> pointer: function ()
#> print: function (...)
#> schema: Schema, ArrowObject, R6
#> set_pointer: function (xp)
#> ToString: function ()
#> type: function (schema = self$schema)
#> type_id: function (schema = self$schema)
#> ..$ vs :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
#> Inherits from: <ArrowObject>
#> Public:
#> .:xp:.: externalptr
#> cast: function (to_type, safe = TRUE, ...)
#> clone: function (deep = FALSE)
#> Equals: function (other, ...)
#> field_name: active binding
#> initialize: function (xp)
#> invalidate: function ()
#> pointer: function ()
#> print: function (...)
#> schema: Schema, ArrowObject, R6
#> set_pointer: function (xp)
#> ToString: function ()
#> type: function (schema = self$schema)
#> type_id: function (schema = self$schema)
#> ..$ am :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
#> Inherits from: <ArrowObject>
#> Public:
#> .:xp:.: externalptr
#> cast: function (to_type, safe = TRUE, ...)
#> clone: function (deep = FALSE)
#> Equals: function (other, ...)
#> field_name: active binding
#> initialize: function (xp)
#> invalidate: function ()
#> pointer: function ()
#> print: function (...)
#> schema: Schema, ArrowObject, R6
#> set_pointer: function (xp)
#> ToString: function ()
#> type: function (schema = self$schema)
#> type_id: function (schema = self$schema)
#> ..$ gear:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
#> Inherits from: <ArrowObject>
#> Public:
#> .:xp:.: externalptr
#> cast: function (to_type, safe = TRUE, ...)
#> clone: function (deep = FALSE)
#> Equals: function (other, ...)
#> field_name: active binding
#> initialize: function (xp)
#> invalidate: function ()
#> pointer: function ()
#> print: function (...)
#> schema: Schema, ArrowObject, R6
#> set_pointer: function (xp)
#> ToString: function ()
#> type: function (schema = self$schema)
#> type_id: function (schema = self$schema)
#> ..$ carb:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
#> Inherits from: <ArrowObject>
#> Public:
#> .:xp:.: externalptr
#> cast: function (to_type, safe = TRUE, ...)
#> clone: function (deep = FALSE)
#> Equals: function (other, ...)
#> field_name: active binding
#> initialize: function (xp)
#> invalidate: function ()
#> pointer: function ()
#> print: function (...)
#> schema: Schema, ArrowObject, R6
#> set_pointer: function (xp)
#> ToString: function ()
#> type: function (schema = self$schema)
#> type_id: function (schema = self$schema)
#> $ disp:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
#> Inherits from: <ArrowObject>
#> Public:
#> .:xp:.: externalptr
#> cast: function (to_type, safe = TRUE, ...)
#> clone: function (deep = FALSE)
#> Equals: function (other, ...)
#> field_name: active binding
#> initialize: function (xp)
#> invalidate: function ()
#> pointer: function ()
#> print: function (...)
#> schema: Schema, ArrowObject, R6
#> set_pointer: function (xp)
#> ToString: function ()
#> type: function (schema = self$schema)
#> type_id: function (schema = self$schema)
#> $ hp : chr(0)
#> $ drat: NULL
#> $ wt : list()
#> $ qsec: logi(0)
#> - attr(*, "class")= chr "arrow_dplyr_query"
```
<sup>Created on 2022-06-07 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
Attachments
Issue Links
- Blocked
-
ARROW-16777 [R] printing data in Table/RecordBatch print method
- Open
- links to