Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11679

[R] Optimal arrow queries for benchmarking



    • Task
    • Status: Closed
    • Major
    • Resolution: Feedback Received
    • None
    • None
    • Benchmarking, R
    • None



      We are running a continuous benchmarking project (https://h2oai.github.io/db-benchmark). In recent days we added Arrow project.

      It uses R's dplyr and ArrowTable as backend. Queries have been written based on arrow R package documentation.

      There are 10 grouping queries:


      1. q1: sum v1 by id1
        AT %>% select(id1, v1) %>% group_by(id1) %>% collect() %>% summarise(v1=sum(v1, na.rm=TRUE))
      1. q2: sum v1 by id1:id2
        AT %>% select(id1, id2, v1) %>% group_by(id1, id2) %>% collect() %>% summarise(v1=sum(v1, na.rm=TRUE))
      1. q3: sum v1 mean v3 by id3
        AT %>% select(id3, v1, v3) %>% group_by(id3) %>% collect() %>% summarise(v1=sum(v1, na.rm=TRUE), v3=mean(v3, na.rm=TRUE))
      1. q4: mean v1:v3 by id4
        AT %>% select(id4, v1, v2, v3) %>% group_by(id4) %>% collect() %>% summarise_at(.funs=\"mean\", .vars=c(\"v1\",\"v2\",\"v3\"), na.rm=TRUE)
      2. q5: sum v1:v3 by id6
        AT %>% select(id6, v1, v2, v3) %>% group_by(id6) %>% collect () %>% summarise_at(.funs=\"sum\", .vars=c(\"v1\",\"v2\",\"v3\"), na.rm=TRUE)
      1. q6: median v3 sd v3 by id4 id5
        AT %>% select(id4, id5, v3) %>% group_by(id4, id5) %>% collect() %>% summarise(median_v3=median(v3, na.rm=TRUE), sd_v3=sd(v3, na.rm=TRUE))
      1. q7: max v1 - min v2 by id3
        AT %>% select(id3, v1, v2) %>% group_by(id3) %>% collect() %>% summarise(range_v1_v2=max(v1, na.rm=TRUE)-min(v2, na.rm=TRUE))
      1. q8: largest two v3 by id6
        AT %>% select(id6, largest2_v3=v3) %>% filter(!is.na(largest2_v3)) %>% arrange(desc(largest2_v3)) %>% group_by(id6) %>% filter(row_number() <= 2L) %>% compute()
      1. q9: regression v1 v2 by id2 id4
        AT %>% select(id2, id4, v1, v2) %>% group_by(id2, id4) %>% collect() %>% summarise(r2=cor(v1, v2, use=\"na.or.complete\")^2)
      1. q10: sum v3 count by id1:id6
        AT %>% select(id1, id2, id3, id4, id5, id6, v3) %>% group_by(id1, id2, id3, id4, id5, id6) %>% collect() %>% summarise(v3=sum(v3, na.rm=TRUE), count=n())

      Full benchmark script can be found at https://github.com/h2oai/db-benchmark/blob/master/arrow/groupby-arrow.R

      As per my understanding, all above queries (maybe excluding query 8) will not utilize any arrow computation, as of now. It is because those operations are not yet implemented in arrow, and they are falling back to dplyr implementation.

      According to Neal's presentation I watched recently, code written now will over time get improved by improvements in arrow implementation. Continuous benchmark I am working on upgrades software automatically, therefore I would like to use the fact to write code now, and have it faster in future, as arrow implementation progresses. I believe the mentioned queries will not satisfy that, because of `collect()` call in the middle. AFAIU it needs a `compute()` call at the end instead (like now in query 8).

      Is there a way to write this code to be optimal now, and also optimal in future. Similarly as presented by Neal in his presentation?




            Unassigned Unassigned
            jangorecki Jan Gorecki
            0 Vote for this issue
            5 Start watching this issue