Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17361

[R] dplyr::summarize fails with division when divisor is a variable

    XMLWordPrintableJSON

Details

    Description

      Hello,

      I found this odd behaviour when trying to compute an aggregate with dplyr::summarize: When I want to use a pre-defined variable to do a divison while aggregating, the execution fails with 'unsupported expression'. When I the value of the variable as is in the aggregation, it works.

       

      See below:

       

      library(dplyr)
      library(arrow)
      
      small_dataset <- tibble::tibble(
        ## x = rep(c("a", "b"), each = 5),
        y = rep(1:5, 2)
      )
      
      ## convert "small_dataset" into a ...dataset
      tmpdir <- tempfile()
      dir.create(tmpdir)
      write_dataset(small_dataset, tmpdir)
      
      ## works
      open_dataset(tmpdir) %>%
        summarize(value = sum(y) / 10) %>%
        collect()
      
      ## fails
      scale_factor <- 10
      open_dataset(tmpdir) %>%
        summarize(value = sum(y) / scale_factor) %>%
        collect()
      #> Fehler: Error in summarize_eval(names(exprs)[i],
      #> exprs[[i]], ctx, length(.data$group_by_vars) > :
      #   Expression sum(y)/scale_factor is not an aggregate
      #   expression or is not supported in Arrow
      # Call collect() first to pull data into R.
         

      I was not sure how to name this issue/bug (if it is one), so if there is a clearer, more descriptive title you're welcome to adjust.

       

      Thanks for your work!

       

      Oliver

       

      > arrow_info()
      Arrow package version: 8.0.0
      
      Capabilities:
                     
      dataset    TRUE
      substrait FALSE
      parquet    TRUE
      json       TRUE
      s3         TRUE
      utf8proc   TRUE
      re2        TRUE
      snappy     TRUE
      gzip       TRUE
      brotli     TRUE
      zstd       TRUE
      lz4        TRUE
      lz4_frame  TRUE
      lzo       FALSE
      bz2        TRUE
      jemalloc   TRUE
      mimalloc   TRUE
      
      Memory:
                        
      Allocator jemalloc
      Current   64 bytes
      Max       41.25 Kb
      
      Runtime:
                              
      SIMD Level          avx2
      Detected SIMD Level avx2
      
      Build:
                                 
      C++ Library Version   8.0.0
      C++ Compiler            GNU
      C++ Compiler Version 12.1.0 

      Attachments

        Issue Links

          Activity

            People

              paleolimbot Dewey Dunnington
              zauster Oliver Reiter
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m