[ARROW-11067] [C++] CSV reader returns nulls for some strings on macOS - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0
Component/s: C++, R
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/26981

Description

A sample file is attached, showing 10 rows each of strings with consistent failures (false_na = TRUE) and consistent successes (false_na = FALSE). The strings are in the column `json_string` – if relevant, they are geojsons with min nchar of 33,229 and max nchar of 202,515.

When I read this sample file with other R CSV readers (readr and data.table shown), the files are imported correctly and there are no NAs in the json_string column.

When I read with arrow::read_csv_arrow, 50% of the sample json_string column end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so this might not be limited to the R interface, but I can't help debug much further upstream.

aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
bbb <- data.table::fread("demo_data.csv")
ccc <- readr::read_csv("demo_data.csv")
mean(is.na(aaa1$json_string)) # 0.5
mean(is.na(aaa2$column(1))) # Scalar 0.5
mean(is.na(bbb$json_string)) # 0
mean(is.na(ccc$json_string)) # 0

arrow 2.0 (latest CRAN)
readr 1.4.0
data.table 1.13.2
R version 4.0.1 (2020-06-06)
MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

arrow_explanation.png
30/Dec/20 01:15
119 kB
John Sheffield
arrow_failure_cases.csv
29/Dec/20 23:39
29 kB
John Sheffield
arrow_failure_cases.csv
29/Dec/20 23:37
29 kB
John Sheffield
arrowbug1.png
29/Dec/20 23:39
245 kB
John Sheffield
arrowbug1.png
29/Dec/20 23:30
593 kB
John Sheffield
demo_data.csv
29/Dec/20 16:08
594 kB
John Sheffield

Issue Links

links to

GitHub Pull Request #9100

Activity

People

Assignee:: Weston Pace

Reporter:: John Sheffield

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 29/Dec/20 16:25

Updated:: 11/Jan/23 08:17

Resolved:: 05/Jan/21 21:11

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 40m