Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
ARROW-16578 noted that there was a high cost to looping through an ALTREP character vector that we created in the arrow R package. The temporary workaround is to materialize whenever the first element is requested, which is much faster than our initial implementation but is probably not necessary given that other ALTREP character implementations appear to not have this issue:
(Timings before merging ARROW-16578, which reduces the 5 second operation below to 0.05 seconds).
library(arrow, warn.conflicts = FALSE) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20))) write_parquet(df1,"/tmp/test.parquet") df2 <- read_parquet("/tmp/test.parquet") system.time(unique(df1$x)) #> user system elapsed #> 0.022 0.001 0.023 system.time(unique(df2$x)) #> user system elapsed #> 4.529 0.680 5.226 # the speed is almost certainly not due to ALTREP itself # but is probably something to do with our implementation tf <- tempfile() readr::write_csv(df1, tf) df3 <- vroom::vroom(tf, delim = ",", altrep = TRUE) #> Rows: 1000000 Columns: 1 #> ── Column specification ──────────────────────────────────────────────────────── #> Delimiter: "," #> dbl (1): x #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. .Internal(inspect(df3$x)) #> @2d2042048 14 REALSXP g0c0 [REF(65535)] vroom_dbl (len=1000000, materialized=F) system.time(unique(df3$x)) #> user system elapsed #> 0.127 0.001 0.128 .Internal(inspect(df3$x)) #> @2d2042048 14 REALSXP g1c0 [MARK,REF(65535)] vroom_dbl (len=1000000, materialized=F)
Attachments
Issue Links
- links to