Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-18219

[R] read_csv_arrow fails when a string contains a backslash-escaped quote mark followed by a comma

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 10.0.0
    • None
    • R
    • None

    Description

      `read_csv_arrow()` incorrectly parses CSV files when a string value contains a comma that appears after a backslash-escaped quote mark. Originally noted by Thomas Klebel https://scicomm.xyz/@tklebel/109270436511066953

      This is an example that throws the error:

      x <- tempfile()
      readr::write_lines(
      '
      id,text
      1,"some text on \\"BLAH
      " and X, and Y also"
      ', x)
      
      cat(system(paste('cat', x), intern = TRUE), sep = "\n")
      #> 
      #> id,text
      #> 1,"some text on \"BLAH\" and X, and Y also"
      arrow::read_csv_arrow(x, escape_backslash = TRUE)
      #> Error:
      #> ! Invalid: CSV parse error: Expected 2 columns, got 3: 1,"some text on \"BLAH\" and X, and Y also"
      
      #> Backtrace:
      #> ▆
      #> 1. └─arrow (local) `<fn>`(file = x, escape_backslash = TRUE, delim = ",")
      #> 2. └─base::tryCatch(...) at r/R/csv.R:217:2
      #> 3. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
      #> 4. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
      #> 5. └─value[[3L]](cond)
      #> 6. └─arrow:::augment_io_error_msg(e, call, schema = schema) at r/R/csv.R:222:6
      #> 7. └─rlang::abort(msg, call = call) at r/R/util.R:251:2
      

      <sup>Created on 2022-11-02 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

      This version includes four lines that might potentially error but do not:

      x <- tempfile()
      readr::write_lines(
      '
      id,text
      2,"some text on X and Y"
      3,"some text on X, and Y"
      4,"some text on \\"BLAH
      "
      5,"some text on X and Y, and \\"BLAH
      " also"
      ', x)
      
      cat(system(paste('cat', x), intern = TRUE), sep = "\n")
      #> 
      #> id,text
      #> 2,"some text on X and Y"
      #> 3,"some text on X, and Y"
      #> 4,"some text on \"BLAH\"
      #> 5,"some text on X and Y, and \"BLAH\" also"
      arrow::read_csv_arrow(x, escape_backslash = TRUE)
      #> # A tibble: 4 × 2
      #> id text 
      #> <int> <chr> 
      #> 1 2 "some text on X and Y" 
      #> 2 3 "some text on X, and Y" 
      #> 3 4 "some text on \\BLAH\\\"" 
      #> 4 5 "some text on X and Y, and \\BLAH\\\" also\""
      

      <sup>Created on 2022-11-02 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

      I'm not sure if the problem is R specific. I've partially reproduced the error using reticulate and pyarrow as follows, but notice that this errors at a different point: the pyarrow version appears to fail with the comma preceding the backslash-escaped quote mark:

      x <- tempfile()
      readr::write_lines(
      '
      id,text
      1,"some text on X and Y"
      2,"some text on X, and Y"
      3,"some text on \\"BLAH
      "
      4,"some text on X and Y, and \\"BLAH
      " also"
      5,"some text on \\"BLAH
      " and X, and Y also"
      ', x)
      
      cat(system(paste('cat', x), intern = TRUE), sep = "\n")
      #> 
      #> id,text
      #> 1,"some text on X and Y"
      #> 2,"some text on X, and Y"
      #> 3,"some text on \"BLAH\"
      #> 4,"some text on X and Y, and \"BLAH\" also"
      #> 5,"some text on \"BLAH\" and X, and Y also"
      
      csv <- reticulate::import("pyarrow.csv")
      opt <- csv$ParseOptions(escape_char='
      ')
      csv$read_csv(x, parse_options = opt)
      #> Error in py_call_impl(callable, dots$args, dots$keywords): pyarrow.lib.ArrowInvalid: CSV parse error: Expected 2 columns, got 3: 3,"some text on \"BLAH\"
      #> 4,"some text on X and Y, and \"BLAH\" also"
      

      <sup>Created on 2022-11-02 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

      Attachments

        Activity

          People

            Unassigned Unassigned
            djnavarro Danielle Navarro
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: