Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14644

[C++] open_dataset doesn't ignore BOM in csv file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 6.0.0
    • 7.0.0
    • C++
    • macOS Mojave, R 4.1.1

    Description

      DragosMG: I believe this is a bug that should be fixed in the C++ code as there isn't an option we could leverage on the R side.

      I have draft PR with a failing test, but it's identical to Andy's reproducible example below.

      Original description below:
      ======================
      When a CSV file starts with byte order mark, arrow::open_dataset() reads the file but populates the first column with NA values. It appears a similar issue was raised and fixed here: https://issues.apache.org/jira/browse/ARROW-5413. read_csv_arrow() deals with the BOM correctly.

      Reproducible Example:

      library(arrow)
      library(dplyr)
      
      writeLines('\xef\xbb\xbfa,b\n1,2\n', con = "testfile.csv")
      
      read_csv_arrow("testfile.csv") # works
      #> # A tibble: 1 × 2
      #> a b
      #> <int> <int>
      #> 1 1 2
      
      open_dataset("testfile.csv", format = "csv") |> 
        collect()
      #> # A tibble: 1 × 2
      #> a b
      #> <int> <int>
      #> 1 NA 2 

      Attachments

        Issue Links

          Activity

            People

              wjones127 Will Jones
              wjones127 Will Jones
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 50m
                  3h 50m