Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Done
-
None
-
None
-
None
Description
This came up working on benchmarking Arrow's CSV reading. As far as I can tell this only impacts R, and only when reading the csv into arrow (but not pulling it in to R). It appears that most arrow interactions after the csv is read will result in this behavior not happening.
What I'm seeing is that on subsequent reads, the time to read gets longer and longer (frequently in a stair step pattern where every other iteration takes longer).
> system.time({ + for (i in 1:10) { + print(system.time(tab <- read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE))) + tab <- NULL + } + }) user system elapsed 24.788 19.485 7.216 user system elapsed 24.952 21.786 9.225 user system elapsed 25.150 23.039 10.332 user system elapsed 25.382 31.012 17.995 user system elapsed 25.309 25.140 12.356 user system elapsed 25.302 26.975 13.938 user system elapsed 25.509 34.390 21.134 user system elapsed 25.674 28.195 15.048 user system elapsed 25.031 28.094 16.449 user system elapsed 25.825 37.165 23.379 # total time: user system elapsed 256.178 299.671 175.119
Interestingly, doing something as unrelated as arrow:::default_memory_pool() which is only getting the default memory pool. Other interactions totally unrelated to the table also similarly alleviate this behavior (e.g. empty_tab <- Table$create(data.frame())) or proactively invalidating with tab$invalidate()
> system.time({ + for (i in 1:10) { + print(system.time(tab <- read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE))) + pool <- arrow:::default_memory_pool() + tab <- NULL + } + }) user system elapsed 25.257 19.475 6.785 user system elapsed 25.271 19.838 6.821 user system elapsed 25.288 20.103 6.861 user system elapsed 25.188 20.290 7.217 user system elapsed 25.283 20.043 6.832 user system elapsed 25.194 19.947 6.906 user system elapsed 25.278 19.993 6.834 user system elapsed 25.355 20.018 6.833 user system elapsed 24.986 19.869 6.865 user system elapsed 25.130 19.878 6.798 # total time: user system elapsed 255.381 210.598 83.109 >
I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the same behavior.
I checked against pyarrow, and do not see the same:
from pyarrow import csv import time for i in range(1, 10): start = time.time() table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv") print(time.time() - start) del table
results:
7.586184978485107 7.542470932006836 7.92852783203125 7.647372007369995 7.742412805557251 8.101378917694092 7.7359960079193115 7.843957901000977 7.6457719802856445