[ARROW-14644] [C++] open_dataset doesn't ignore BOM in csv file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 6.0.0
Fix Version/s: 7.0.0
Component/s: C++
Labels:
- pull-request-available
Environment:
macOS Mojave, R 4.1.1

External issue URL:
https://github.com/apache/arrow/issues/30187
Language:
- C++

Description

DragosMG: I believe this is a bug that should be fixed in the C++ code as there isn't an option we could leverage on the R side.

I have draft PR with a failing test, but it's identical to Andy's reproducible example below.

Original description below:
======================
When a CSV file starts with byte order mark, arrow::open_dataset() reads the file but populates the first column with NA values. It appears a similar issue was raised and fixed here: https://issues.apache.org/jira/browse/ARROW-5413. read_csv_arrow() deals with the BOM correctly.

Reproducible Example:

library(arrow)
library(dplyr)

writeLines('\xef\xbb\xbfa,b\n1,2\n', con = "testfile.csv")

read_csv_arrow("testfile.csv") # works
#> # A tibble: 1 × 2
#> a b
#> <int> <int>
#> 1 1 2

open_dataset("testfile.csv", format = "csv") |> 
  collect()
#> # A tibble: 1 × 2
#> a b
#> <int> <int>
#> 1 NA 2

Attachments

Issue Links

causes

ARROW-15041 [R] Flaky BOM removal test

Resolved

links to

GitHub Pull Request #11871

GitHub Pull Request #11892

Activity

People

Assignee:: Will Jones

Reporter:: Will Jones

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 09/Nov/21 17:41

Updated:: 11/Jan/23 08:41

Resolved:: 08/Dec/21 20:48

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3h 50m