Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
With dplyr, when we group_by with an unnamed expression, a column is added to the dataframe that has the result of the expression.
> example_data %>% + group_by(int < 4) %>% collect() # A tibble: 10 x 8 # Groups: int < 4 [3] int dbl dbl2 lgl false chr fct `int < 4` <int> <dbl> <dbl> <lgl> <lgl> <chr> <fct> <lgl> 1 1 1.1 5 TRUE FALSE a a TRUE 2 2 2.1 5 NA FALSE b b TRUE 3 3 3.1 5 TRUE FALSE c c TRUE 4 NA 4.1 5 FALSE FALSE d d NA 5 5 5.1 5 TRUE FALSE e NA FALSE 6 6 6.1 5 NA FALSE NA NA FALSE 7 7 7.1 5 NA FALSE g g FALSE 8 8 8.1 5 FALSE FALSE h h FALSE 9 9 NA 5 FALSE FALSE i i FALSE 10 10 10.1 5 NA FALSE j j FALSE
Arrow doesn't do this, however because we (currently) only add columns when the expression is named.
> Table$create(example_data) %>% + group_by(int < 4) %>% collect() Error: Invalid: No match for FieldRef.Name(int < 4) in int: int32 dbl: double dbl2: double lgl: bool false: bool chr: string fct: dictionary<values=string, indices=int8, ordered=0>
This isn't a big deal right now since grouped aggregations aren't (quite) here yet, but once we start having support for that, we will have people using examples like this.
Attachments
Issue Links
- links to