Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13434

[R] group_by() with an unnammed expression

    XMLWordPrintableJSON

Details

    Description

      With dplyr, when we group_by with an unnamed expression, a column is added to the dataframe that has the result of the expression.

      > example_data %>% 
      +   group_by(int < 4) %>% collect()
      # A tibble: 10 x 8
      # Groups:   int < 4 [3]
           int   dbl  dbl2 lgl   false chr   fct   `int < 4`
         <int> <dbl> <dbl> <lgl> <lgl> <chr> <fct> <lgl>    
       1     1   1.1     5 TRUE  FALSE a     a     TRUE     
       2     2   2.1     5 NA    FALSE b     b     TRUE     
       3     3   3.1     5 TRUE  FALSE c     c     TRUE     
       4    NA   4.1     5 FALSE FALSE d     d     NA       
       5     5   5.1     5 TRUE  FALSE e     NA    FALSE    
       6     6   6.1     5 NA    FALSE NA    NA    FALSE    
       7     7   7.1     5 NA    FALSE g     g     FALSE    
       8     8   8.1     5 FALSE FALSE h     h     FALSE    
       9     9  NA       5 FALSE FALSE i     i     FALSE    
      10    10  10.1     5 NA    FALSE j     j     FALSE    
      

      Arrow doesn't do this, however because we (currently) only add columns when the expression is named.

      > Table$create(example_data) %>% 
      +   group_by(int < 4) %>% collect()
       Error: Invalid: No match for FieldRef.Name(int < 4) in int: int32
      dbl: double
      dbl2: double
      lgl: bool
      false: bool
      chr: string
      fct: dictionary<values=string, indices=int8, ordered=0> 
      

      This isn't a big deal right now since grouped aggregations aren't (quite) here yet, but once we start having support for that, we will have people using examples like this.

      Attachments

        Issue Links

          Activity

            People

              jonkeane Jonathan Keane
              jonkeane Jonathan Keane
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m