Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11415

[R] experimental map_batches cannot find columns

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0
    • 8.0.0
    • R

    Description

      With dataset:

       

      Schema
      X3: timestamp[us]
      user_id: dictionary<values=string, indices=int32>
      classification_name: dictionary<values=string, indices=int32>
      X2: string
      X1: dictionary<values=string, indices=int32>
      X4: string
      X5: dictionary<values=string, indices=int32>
      X6: dictionary<values=string, indices=int32>
      

      The following succeeds:

      chunk <- ds %>%
          select(user_id) %>%
          collect() %>%
          count(user_id) %>%
          as_tibble() %>%
          count(user_id, wt=n)
      

      While the following fails:

      chunk <- ds %>%
          select(user_id) %>%
          arrow::map_batches(~count(., user_id)) %>%
          as_tibble() %>%
          count(user_id, wt=x)
      

      With error:

      Error: Can't subset columns that don't exist.
      ✖ Column `.drop` doesn't exist.
      Traceback:
      
      1. ds %>% select(user_id) %>% arrow::map_batches(~count(., 
       .     user_id)) %>% as_tibble() %>% count(user_id, wt = x)
      2. count(., user_id, wt = x)
      3. group_by(x, ..., .add = TRUE, .drop = .drop)
      4. as_tibble(.)
      5. arrow::map_batches(., ~count(., user_id))
      6. lapply(scanner$Scan(), function(scan_task) {
       .     lapply(scan_task$Execute(), function(batch) {
       .         FUN(batch, ...)
       .     })
       . })
      7. map(.x, .f, ...)
      8. .f(.x[[i]], ...)
      9. lapply(scan_task$Execute(), function(batch) {
       .     FUN(batch, ...)
       . })
      10. map(.x, .f, ...)
      11. .f(.x[[i]], ...)
      12. FUN(batch, ...)
      13. count(., user_id)
      14. tally(out, wt = !!enquo(wt), sort = sort, name = name)
      15. (function() {
        .     old.options <- options(dplyr.summarise.inform = FALSE)
        .     on.exit(options(old.options))
        .     summarise(x, `:=`(!!name, !!n))
        . })()
      16. summarise(x, `:=`(!!name, !!n))
      17. summarise.arrow_dplyr_query(x, `:=`(!!name, !!n))
      18. dplyr::select(.data, vars_to_keep)
      19. select.arrow_dplyr_query(.data, vars_to_keep)
      20. column_select(arrow_dplyr_query(.data), !!!enquos(...))
      21. .FUN(names(.data), !!!enquos(...))
      22. eval_select_impl(NULL, .vars, expr(c(!!!dots)), include = .include, 
        .     exclude = .exclude, strict = .strict, name_spec = unique_name_spec, 
        .     uniquely_named = TRUE)
      23. with_subscript_errors(vars_select_eval(vars, expr, strict, data = x, 
        .     name_spec = name_spec, uniquely_named = uniquely_named, allow_rename = allow_rename, 
        .     type = type), type = type)
      24. tryCatch(instrument_base_errors(expr), vctrs_error_subscript = function(cnd) {
        .     cnd$subscript_action <- subscript_action(type)
        .     cnd$subscript_elt <- "column"
        .     cnd_signal(cnd)
        . })
      25. tryCatchList(expr, classes, parentenv, handlers)
      26. tryCatchOne(expr, names, parentenv, handlers[[1L]])
      27. value[[3L]](cond)
      28. cnd_signal(cnd)
      29. rlang:::signal_abort(x)
      

      The dataset is 8 parquet files with no hive partitioning.

       

      sessionInfo():

      R version 4.0.3 (2020-10-10)
      Platform: x86_64-pc-linux-gnu (64-bit)
      Running under: Ubuntu 18.04.3 LTSMatrix products: default
      BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
      LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.solocale:
       [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
       [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
       [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
       [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
       [9] LC_ADDRESS=C               LC_TELEPHONE=C            
      [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       attached base packages:
      [1] stats     graphics  grDevices utils     datasets  methods   base     other attached packages:
       [1] forcats_0.5.0     stringr_1.4.0     dplyr_1.0.2       purrr_0.3.4      
       [5] readr_1.4.0       tidyr_1.1.2       tibble_3.0.4      ggplot2_3.3.2    
       [9] tidyverse_1.3.0   dtplyr_1.0.1      data.table_1.13.2loaded via a namespace (and not attached):
       [1] Rcpp_1.0.5            lubridate_1.7.9.2     aws.ec2metadata_0.2.0
       [4] ps_1.5.0              arrow_2.0.0           assertthat_0.2.1     
       [7] digest_0.6.27         utf8_1.1.4            aws.signature_0.6.0  
      [10] mime_0.9              IRdisplay_0.7.0       R6_2.5.0             
      [13] cellranger_1.1.0      repr_1.1.0            backports_1.2.0      
      [16] reprex_0.3.0          evaluate_0.14         httr_1.4.2           
      [19] pillar_1.4.7          rlang_0.4.9           curl_4.3             
      [22] uuid_0.1-4            readxl_1.3.1          rstudioapi_0.13      
      [25] bit_4.0.4             munsell_0.5.0         broom_0.7.2          
      [28] compiler_4.0.3        modelr_0.1.8          pkgconfig_2.0.3      
      [31] base64enc_0.1-3       htmltools_0.5.0       tidyselect_1.1.0     
      [34] fansi_0.4.1           crayon_1.3.4          dbplyr_2.0.0         
      [37] withr_2.3.0           grid_4.0.3            jsonlite_1.7.1       
      [40] gtable_0.3.0          lifecycle_0.2.0       DBI_1.1.0            
      [43] magrittr_2.0.1        scales_1.1.1          cli_2.2.0            
      [46] stringi_1.5.3         fs_1.5.0              xml2_1.3.2           
      [49] ellipsis_0.3.1        generics_0.1.0        vctrs_0.3.5          
      [52] IRkernel_1.1.1        tools_4.0.3           bit64_4.0.5          
      [55] glue_1.4.2            hms_0.5.3             aws.s3_0.3.22        
      [58] colorspace_2.0-0      rvest_0.3.6           pbdZMQ_0.3-3.1
      

       

      Attachments

        Issue Links

          Activity

            People

              wjones127 Will Jones
              wjones127 Will Jones
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m

                  Slack

                    Issue deployment