Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16682

[Python] CSV reader: allow parsing without encoding errors

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 8.0.0
    • None
    • Python
    • None

    Description

      When trying to read arbitrary CSV files, it is not possible to infer/guess the correct encoding 100% of the time. The Arrow CSV reader will currently fail if any byte cannot be decoded given the specified encoding (see example below).

      With pandas.read_csv(), I can often get a result that is 99.9% correct by passing it a text stream decoded in Python with errors="replace" (or "ignore" etc.).

      Pyarrow's csv.read_csv() on the other hand neither accepts an already decoded text stream (TypeError: binary file expected, got text file), nor a parameter to configure what to do with decoding errors. As a result the parser simply fails.

      The simplest solution would probably be to expose Python's error handling in pyarrow.csv.ReadOptions (e.g. encoding_errors: "strict" | "ignore" | "replace" ...).

      It would also be useful to document the behaviour of the CSV reader. E.g. that it only accepts binary streams, and how encoding errors are handled. In particular it is unclear what "Columns that cannot decode using this encoding can still be read as Binary" means, since the parser will currently fail if any bytes cannot be decoded.

      Toy example:

       

      txt = """
      col_😀_1, col2
      0,a
      1,b
      """
      buffer = io.BytesIO(txt.encode("utf-8"))
      pa.csv.read_csv(buffer, pa.csv.ReadOptions(encoding="ascii"))
      UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 5: ordinal not in range(128)

      whereas "with pandas":

      buffer = io.BytesIO(txt.encode("utf-8"))
      text = io.TextIOWrapper(buffer, encoding="ascii", errors="replace")
      pd.read_csv(text)
         col_����_1  col2
      0           0     a
      1           1     b
      

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            buhrmann Thomas Buhrmann
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: