[ARROW-17559] [R][C++] Regression: big performance hit after removing schema binding - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 9.0.0
Fix Version/s: 10.0.0
Component/s: C++, R
Labels:
- R
- compute
Environment:
ubuntu 2020

External issue URL:
https://github.com/apache/arrow/issues/32810
Language:
- C++
- R

Description

After ~~ARROW-15260~~ I observe a big memory and compute time increases with basic sumarize queries. My use case shows almost 10x memory and 10x computation time increases in some cases.

Here is a less dramatic replication along my real use case which gives 2x time increase:

  library(arrow)
  dir.create(dir <- "/tmp/iris", showWarnings = F)
  for (day in seq_len(100)) {
    dir.create(glue("{dir}/day={day}"), showWarnings = F)
    for (i in seq_len(10)) {
      dfs <- map(seq_len(20), function(j) {
        names(iris) <- paste0(names(iris), j)
        iris
      })
      df <- dplyr::bind_cols(!!!dfs)
      write_parquet(df, glue("{dir}/day={day}/{i}.parquet"))
    }
  }

  library(arrow)
  system.time(
    open_dataset("/tmp/iris") %>%
    group_by(day, Species1) %>%
    summarise(N = n(), .groups = "drop") %>%
    collect())

Before commit 838687178: 0.33sec, after: 0.73sec.

If I put back the schema Binding which was removed here I get the performance back.

Attachments

Issue Links

is fixed by

ARROW-17556 [C++] Unbound scan projection expression leads to all fields being loaded

Resolved

Activity

People

Assignee:: Vibhatha Lakmal Abeykoon

Reporter:: Vitalie Spinu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 29/Aug/22 22:12

Updated:: 11/Jan/23 11:51

Resolved:: 18/Oct/22 12:58