Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17187

[R] Improve lazy ALTREP implementation for String

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 11.0.0
    • R

    Description

      ARROW-16578 noted that there was a high cost to looping through an ALTREP character vector that we created in the arrow R package. The temporary workaround is to materialize whenever the first element is requested, which is much faster than our initial implementation but is probably not necessary given that other ALTREP character implementations appear to not have this issue:

      (Timings before merging ARROW-16578, which reduces the 5 second operation below to 0.05 seconds).

      library(arrow, warn.conflicts = FALSE)
      #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
      
      df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))
      write_parquet(df1,"/tmp/test.parquet")
      df2 <- read_parquet("/tmp/test.parquet")
      system.time(unique(df1$x))
      #>    user  system elapsed 
      #>   0.022   0.001   0.023
      system.time(unique(df2$x))
      #>    user  system elapsed 
      #>   4.529   0.680   5.226
      
      # the speed is almost certainly not due to ALTREP itself
      # but is probably something to do with our implementation
      tf <- tempfile()
      readr::write_csv(df1, tf)
      df3 <- vroom::vroom(tf, delim = ",", altrep = TRUE)
      #> Rows: 1000000 Columns: 1
      #> ── Column specification ────────────────────────────────────────────────────────
      #> Delimiter: ","
      #> dbl (1): x
      #> 
      #> ℹ Use `spec()` to retrieve the full column specification for this data.
      #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
      .Internal(inspect(df3$x))
      #> @2d2042048 14 REALSXP g0c0 [REF(65535)] vroom_dbl (len=1000000, materialized=F)
      system.time(unique(df3$x))
      #>    user  system elapsed 
      #>   0.127   0.001   0.128
      .Internal(inspect(df3$x))
      #> @2d2042048 14 REALSXP g1c0 [MARK,REF(65535)] vroom_dbl (len=1000000, materialized=F)
      

      Attachments

        Issue Links

          Activity

            People

              paleolimbot Dewey Dunnington
              paleolimbot Dewey Dunnington
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h
                  3h