Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17559

[R][C++] Regression: big performance hit after removing schema binding

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 9.0.0
    • 10.0.0
    • C++, R
    • ubuntu 2020

    Description

      After ARROW-15260 I observe a big memory and compute time increases with basic sumarize queries. My use case shows almost 10x memory and 10x computation time increases in some cases.

      Here is a less dramatic replication along my real use case which gives 2x time increase:

        library(arrow)
        dir.create(dir <- "/tmp/iris", showWarnings = F)
        for (day in seq_len(100)) {
          dir.create(glue("{dir}/day={day}"), showWarnings = F)
          for (i in seq_len(10)) {
            dfs <- map(seq_len(20), function(j) {
              names(iris) <- paste0(names(iris), j)
              iris
            })
            df <- dplyr::bind_cols(!!!dfs)
            write_parquet(df, glue("{dir}/day={day}/{i}.parquet"))
          }
        }
      
        library(arrow)
        system.time(
          open_dataset("/tmp/iris") %>%
          group_by(day, Species1) %>%
          summarise(N = n(), .groups = "drop") %>%
          collect())
      
      

      Before commit 838687178: 0.33sec, after: 0.73sec.

      If I put back the schema Binding which was removed here I get the performance back.

      Attachments

        Issue Links

          Activity

            People

              vibhatha Vibhatha Lakmal Abeykoon
              vspinu Vitalie Spinu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: