Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
2.0.0
Description
With dataset:
Schema X3: timestamp[us] user_id: dictionary<values=string, indices=int32> classification_name: dictionary<values=string, indices=int32> X2: string X1: dictionary<values=string, indices=int32> X4: string X5: dictionary<values=string, indices=int32> X6: dictionary<values=string, indices=int32>
The following succeeds:
chunk <- ds %>% select(user_id) %>% collect() %>% count(user_id) %>% as_tibble() %>% count(user_id, wt=n)
While the following fails:
chunk <- ds %>% select(user_id) %>% arrow::map_batches(~count(., user_id)) %>% as_tibble() %>% count(user_id, wt=x)
With error:
Error: Can't subset columns that don't exist. ✖ Column `.drop` doesn't exist. Traceback: 1. ds %>% select(user_id) %>% arrow::map_batches(~count(., . user_id)) %>% as_tibble() %>% count(user_id, wt = x) 2. count(., user_id, wt = x) 3. group_by(x, ..., .add = TRUE, .drop = .drop) 4. as_tibble(.) 5. arrow::map_batches(., ~count(., user_id)) 6. lapply(scanner$Scan(), function(scan_task) { . lapply(scan_task$Execute(), function(batch) { . FUN(batch, ...) . }) . }) 7. map(.x, .f, ...) 8. .f(.x[[i]], ...) 9. lapply(scan_task$Execute(), function(batch) { . FUN(batch, ...) . }) 10. map(.x, .f, ...) 11. .f(.x[[i]], ...) 12. FUN(batch, ...) 13. count(., user_id) 14. tally(out, wt = !!enquo(wt), sort = sort, name = name) 15. (function() { . old.options <- options(dplyr.summarise.inform = FALSE) . on.exit(options(old.options)) . summarise(x, `:=`(!!name, !!n)) . })() 16. summarise(x, `:=`(!!name, !!n)) 17. summarise.arrow_dplyr_query(x, `:=`(!!name, !!n)) 18. dplyr::select(.data, vars_to_keep) 19. select.arrow_dplyr_query(.data, vars_to_keep) 20. column_select(arrow_dplyr_query(.data), !!!enquos(...)) 21. .FUN(names(.data), !!!enquos(...)) 22. eval_select_impl(NULL, .vars, expr(c(!!!dots)), include = .include, . exclude = .exclude, strict = .strict, name_spec = unique_name_spec, . uniquely_named = TRUE) 23. with_subscript_errors(vars_select_eval(vars, expr, strict, data = x, . name_spec = name_spec, uniquely_named = uniquely_named, allow_rename = allow_rename, . type = type), type = type) 24. tryCatch(instrument_base_errors(expr), vctrs_error_subscript = function(cnd) { . cnd$subscript_action <- subscript_action(type) . cnd$subscript_elt <- "column" . cnd_signal(cnd) . }) 25. tryCatchList(expr, classes, parentenv, handlers) 26. tryCatchOne(expr, names, parentenv, handlers[[1L]]) 27. value[[3L]](cond) 28. cnd_signal(cnd) 29. rlang:::signal_abort(x)
The dataset is 8 parquet files with no hive partitioning.
sessionInfo():
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTSMatrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.solocale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:
[1] stats graphics grDevices utils datasets methods base other attached packages:
[1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4
[5] readr_1.4.0 tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2
[9] tidyverse_1.3.0 dtplyr_1.0.1 data.table_1.13.2loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 lubridate_1.7.9.2 aws.ec2metadata_0.2.0
[4] ps_1.5.0 arrow_2.0.0 assertthat_0.2.1
[7] digest_0.6.27 utf8_1.1.4 aws.signature_0.6.0
[10] mime_0.9 IRdisplay_0.7.0 R6_2.5.0
[13] cellranger_1.1.0 repr_1.1.0 backports_1.2.0
[16] reprex_0.3.0 evaluate_0.14 httr_1.4.2
[19] pillar_1.4.7 rlang_0.4.9 curl_4.3
[22] uuid_0.1-4 readxl_1.3.1 rstudioapi_0.13
[25] bit_4.0.4 munsell_0.5.0 broom_0.7.2
[28] compiler_4.0.3 modelr_0.1.8 pkgconfig_2.0.3
[31] base64enc_0.1-3 htmltools_0.5.0 tidyselect_1.1.0
[34] fansi_0.4.1 crayon_1.3.4 dbplyr_2.0.0
[37] withr_2.3.0 grid_4.0.3 jsonlite_1.7.1
[40] gtable_0.3.0 lifecycle_0.2.0 DBI_1.1.0
[43] magrittr_2.0.1 scales_1.1.1 cli_2.2.0
[46] stringi_1.5.3 fs_1.5.0 xml2_1.3.2
[49] ellipsis_0.3.1 generics_0.1.0 vctrs_0.3.5
[52] IRkernel_1.1.1 tools_4.0.3 bit64_4.0.5
[55] glue_1.4.2 hms_0.5.3 aws.s3_0.3.22
[58] colorspace_2.0-0 rvest_0.3.6 pbdZMQ_0.3-3.1
Attachments
Issue Links
- is related to
-
ARROW-14029 [R] Repair map_batches()
- Resolved
- links to