[ARROW-16578] [R] unique() and is.na() on a column of a tibble is much slower after writing to and reading from a parquet file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 7.0.0, 8.0.0
Fix Version/s: 9.0.0
Component/s: R
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/31935

Description

unique() on a column of a tibble is much slower after writing to and reading from a parquet file.

Here is a reprex.

df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))
write_parquet(df1,"/tmp/test.parquet")
df2 <- read_parquet("/tmp/test.parquet")
system.time(unique(df1$x))
# Result on my late 2020 macbook pro with M1 processor:
# user system elapsed
# 0.020 0.000 0.021
system.time(unique(df2$x))
# user system elapsed
# 5.230 0.419 5.649

Attachments

Issue Links

is related to

ARROW-16188 [R] Fix excess "Handling string data with embedded nuls" warning in tests

Open

links to

GitHub Pull Request #13415

Activity

People

Assignee:: Hideaki Hayashi

Reporter:: Hideaki Hayashi

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/May/22 00:46

Updated:: 11/Jan/23 11:45

Resolved:: 22/Jul/22 17:04

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1.5h