Details

Task

Status: Closed

Major

Resolution: Feedback Received

None

None

None
Description
Hi
Hello,
We are running a continuous benchmarking project (https://h2oai.github.io/dbbenchmark). In recent days we added Arrow project.
It uses R's dplyr and ArrowTable as backend. Queries have been written based on arrow R package documentation.
There are 10 grouping queries:
```
 q1: sum v1 by id1
AT %>% select(id1, v1) %>% group_by(id1) %>% collect() %>% summarise(v1=sum(v1, na.rm=TRUE))
 q2: sum v1 by id1:id2
AT %>% select(id1, id2, v1) %>% group_by(id1, id2) %>% collect() %>% summarise(v1=sum(v1, na.rm=TRUE))
 q3: sum v1 mean v3 by id3
AT %>% select(id3, v1, v3) %>% group_by(id3) %>% collect() %>% summarise(v1=sum(v1, na.rm=TRUE), v3=mean(v3, na.rm=TRUE))
 q4: mean v1:v3 by id4
AT %>% select(id4, v1, v2, v3) %>% group_by(id4) %>% collect() %>% summarise_at(.funs=\"mean\", .vars=c(\"v1\",\"v2\",\"v3\"), na.rm=TRUE)  q5: sum v1:v3 by id6
AT %>% select(id6, v1, v2, v3) %>% group_by(id6) %>% collect () %>% summarise_at(.funs=\"sum\", .vars=c(\"v1\",\"v2\",\"v3\"), na.rm=TRUE)
 q6: median v3 sd v3 by id4 id5
AT %>% select(id4, id5, v3) %>% group_by(id4, id5) %>% collect() %>% summarise(median_v3=median(v3, na.rm=TRUE), sd_v3=sd(v3, na.rm=TRUE))
 q7: max v1  min v2 by id3
AT %>% select(id3, v1, v2) %>% group_by(id3) %>% collect() %>% summarise(range_v1_v2=max(v1, na.rm=TRUE)min(v2, na.rm=TRUE))
 q8: largest two v3 by id6
AT %>% select(id6, largest2_v3=v3) %>% filter(!is.na(largest2_v3)) %>% arrange(desc(largest2_v3)) %>% group_by(id6) %>% filter(row_number() <= 2L) %>% compute()
 q9: regression v1 v2 by id2 id4
AT %>% select(id2, id4, v1, v2) %>% group_by(id2, id4) %>% collect() %>% summarise(r2=cor(v1, v2, use=\"na.or.complete\")^2)
 q10: sum v3 count by id1:id6
AT %>% select(id1, id2, id3, id4, id5, id6, v3) %>% group_by(id1, id2, id3, id4, id5, id6) %>% collect() %>% summarise(v3=sum(v3, na.rm=TRUE), count=n())
```
Full benchmark script can be found at https://github.com/h2oai/dbbenchmark/blob/master/arrow/groupbyarrow.R
As per my understanding, all above queries (maybe excluding query 8) will not utilize any arrow computation, as of now. It is because those operations are not yet implemented in arrow, and they are falling back to dplyr implementation.
According to Neal's presentation I watched recently, code written now will over time get improved by improvements in arrow implementation. Continuous benchmark I am working on upgrades software automatically, therefore I would like to use the fact to write code now, and have it faster in future, as arrow implementation progresses. I believe the mentioned queries will not satisfy that, because of `collect()` call in the middle. AFAIU it needs a `compute()` call at the end instead (like now in query 8).
Is there a way to write this code to be optimal now, and also optimal in future. Similarly as presented by Neal in his presentation?