Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-18195

[R][C++] Final value returned by case_when is NA when input has 64 or more values and 1 or more NAs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 10.0.0
    • 11.0.0
    • C++, R

    Description

      There appears to be a bug when processing an Arrow table with NA values and using `dplyr::case_when`. A reproducible example is below: the output from arrow table processing does not match the output when processing a tibble. If the NA's are removed from the dataframe, then the outputs match.

      ``` r
      library(dplyr)
      #> 
      #> Attaching package: 'dplyr'
      #> The following objects are masked from 'package:stats':
      #> 
      #>     filter, lag
      #> The following objects are masked from 'package:base':
      #> 
      #>     intersect, setdiff, setequal, union
      library(arrow)
      #> 
      #> Attaching package: 'arrow'
      #> The following object is masked from 'package:utils':
      #> 
      #>     timestamp
      library(assertthat)
      
      play_results = c('single', 'double', 'triple', 'home_run')
      
      nrows = 1000
      
      # Change frac_na to 0, and the result error disappears.
      frac_na = 0.05
      
      # Create a test dataframe with NA values
      test_df = tibble(
              play_result = sample(play_results, nrows, replace = TRUE)
          ) %>%
          mutate(
              play_result = ifelse(runif(nrows) < frac_na, NA_character_, play_result)
          )
          
      
      test_arrow = arrow_table(test_df)
      
      process_plays = function(df) {
          df %>%
              mutate(
                  avg = case_when(
                      play_result == 'single' ~ 1,
                      play_result == 'double' ~ 1,
                      play_result == 'triple' ~ 1,
                      play_result == 'home_run' ~ 1,
                      is.na(play_result) ~ NA_real_,
                      TRUE ~ 0
                  )
              ) %>%
              count(play_result, avg) %>%
              arrange(play_result)
      }
      
      # Compare arrow_table reuslt to tibble result
      result_tibble = process_plays(test_df)
      result_arrow = process_plays(test_arrow) %>% collect()
      assertthat::assert_that(identical(result_tibble, result_arrow))
      #> Error: result_tibble not identical to result_arrow
      ```
      
      <sup>Created on 2022-10-29 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
      

      I have reproduced this issue both on Mac OS and Ubuntu 20.04.

       

      ```
      r$> sessionInfo()
      R version 4.2.1 (2022-06-23)
      Platform: aarch64-apple-darwin21.5.0 (64-bit)
      Running under: macOS Monterey 12.5.1
      
      Matrix products: default
      BLAS:   /opt/homebrew/Cellar/openblas/0.3.20/lib/libopenblasp-r0.3.20.dylib
      LAPACK: /opt/homebrew/Cellar/r/4.2.1/lib/R/lib/libRlapack.dylib
      
      locale:
      [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
      
      attached base packages:
      [1] stats     graphics  grDevices datasets  utils     methods   base
      
      other attached packages:
      [1] assertthat_0.2.1 arrow_10.0.0     dplyr_1.0.10
      
      loaded via a namespace (and not attached):
       [1] compiler_4.2.1    pillar_1.8.1      highr_0.9         R.methodsS3_1.8.2 R.utils_2.12.0    tools_4.2.1       bit_4.0.4         digest_0.6.29
       [9] evaluate_0.15     lifecycle_1.0.1   tibble_3.1.8      R.cache_0.16.0    pkgconfig_2.0.3   rlang_1.0.5       reprex_2.0.2      DBI_1.1.2
      [17] cli_3.3.0         rstudioapi_0.13   yaml_2.3.5        xfun_0.31         fastmap_1.1.0     withr_2.5.0       styler_1.8.0      knitr_1.39
      [25] generics_0.1.3    fs_1.5.2          vctrs_0.4.1       bit64_4.0.5       tidyselect_1.1.2  glue_1.6.2        R6_2.5.1          processx_3.5.3
      [33] fansi_1.0.3       rmarkdown_2.14    purrr_0.3.4       callr_3.7.0       clipr_0.8.0       magrittr_2.0.3    ellipsis_0.3.2    ps_1.7.0
      [41] htmltools_0.5.3   renv_0.16.0       utf8_1.2.2        R.oo_1.25.0
      ```
      

      Attachments

        1. test_issue.R
          1 kB
          Lee Mendelowitz

        Issue Links

          Activity

            People

              wjones127 Will Jones
              LMendy Lee Mendelowitz
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m