Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11433

[R] Unexpectedly slow results reading csv

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Done
    • None
    • None
    • R
    • None

    Description

      This came up working on benchmarking Arrow's CSV reading. As far as I can tell this only impacts R, and only when reading the csv into arrow (but not pulling it in to R). It appears that most arrow interactions after the csv is read will result in this behavior not happening.

      What I'm seeing is that on subsequent reads, the time to read gets longer and longer (frequently in a stair step pattern where every other iteration takes longer).

      > system.time({
      +   for (i in 1:10) {
      +     print(system.time(tab <- read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
      +     tab <- NULL
      +   }
      + })
         user  system elapsed 
       24.788  19.485   7.216 
         user  system elapsed 
       24.952  21.786   9.225 
         user  system elapsed 
       25.150  23.039  10.332 
         user  system elapsed 
       25.382  31.012  17.995 
         user  system elapsed 
       25.309  25.140  12.356 
         user  system elapsed 
       25.302  26.975  13.938 
         user  system elapsed 
       25.509  34.390  21.134 
         user  system elapsed 
       25.674  28.195  15.048 
         user  system elapsed 
       25.031  28.094  16.449 
         user  system elapsed 
       25.825  37.165  23.379 
      # total time:
         user  system elapsed 
      256.178 299.671 175.119 
      

      Interestingly, doing something as unrelated as arrow:::default_memory_pool() which is only getting the default memory pool. Other interactions totally unrelated to the table also similarly alleviate this behavior (e.g. empty_tab <- Table$create(data.frame())) or proactively invalidating with tab$invalidate()

      > system.time({
      +   for (i in 1:10) {
      +     print(system.time(tab <- read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
      +     pool <- arrow:::default_memory_pool()
      +     tab <- NULL
      +   }
      + })
         user  system elapsed 
       25.257  19.475   6.785 
         user  system elapsed 
       25.271  19.838   6.821 
         user  system elapsed 
       25.288  20.103   6.861 
         user  system elapsed 
       25.188  20.290   7.217 
         user  system elapsed 
       25.283  20.043   6.832 
         user  system elapsed 
       25.194  19.947   6.906 
         user  system elapsed 
       25.278  19.993   6.834 
         user  system elapsed 
       25.355  20.018   6.833 
         user  system elapsed 
       24.986  19.869   6.865 
         user  system elapsed 
       25.130  19.878   6.798 
      # total time:
         user  system elapsed 
      255.381 210.598  83.109 ​
      > 
      

      I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the same behavior.

      I checked against pyarrow, and do not see the same:

      from pyarrow import csv
      import time
      
      for i in range(1, 10):
          start = time.time()
          table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
          print(time.time() - start)
          del table
      

      results:

      7.586184978485107
      7.542470932006836
      7.92852783203125
      7.647372007369995
      7.742412805557251
      8.101378917694092
      7.7359960079193115
      7.843957901000977
      7.6457719802856445
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            jonkeane Jonathan Keane
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: