Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-4883

[Python] read_csv() returns garbage if given file object in text mode

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.12.1
    • Fix Version/s: 0.15.0
    • Component/s: Python
    • Labels:
    • Environment:
      Python: 3.7.2, 2.7.15
      PyArrow: 0.12.1
      OS: MacOS 10.13.6 (High Sierra)

      Description

      Summary:

      Python 3:

      • read_csv returns mojibake if given file objects opened in text mode. It behaves as expected in binary mode.
      • Files encoded in anything other than valid UTF-8 will cause a crash.

      Python 2:

      read_csv only handles ASCII files. If given a file in UTF-8 with characters over U+007F, it crashes.

      To reproduce:

      1) Create a CSV like this

      Header
      123.45
      

      2) Then run this code on Python 3:

      >>> import pyarrow.csv as pa_csv
      >>> pa_csv.read_csv(open('test.csv', 'r'))
      pyarrow.Table
      䧢: string
      

      Notice the file descriptor is open in text mode. Changing the encoding doesn't help:

      >>> pa_csv.read_csv(open('test.csv', 'r', encoding='utf-8'))
      pyarrow.Table
      䧢: string
      
      >>> pa_csv.read_csv(open('test.csv', 'r', encoding='ascii'))
      pyarrow.Table
      䧢: string
      
      >>> pa_csv.read_csv(open('test.csv', 'r', encoding='iso-8859-1'))
      pyarrow.Table
      䧢: string
      

      If I open the file in binary mode it works:

      >>> pa_csv.read_csv(open('test.csv', 'rb'))                                                                                                                             
      pyarrow.Table
      Header: double
      

      I tried this with a file encoded in UTF-16 and it freaked out:

                                                                                                                        
      Traceback (most recent call last):
        File "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py", line 84, in _process_text
          self._execute(line)
        File "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py", line 139, in _execute
          result_str = '%s\n' % repr(result).decode('utf-8')
        File "pyarrow/table.pxi", line 960, in pyarrow.lib.Table.__repr__
        File "pyarrow/types.pxi", line 903, in pyarrow.lib.Schema.__str__
        File "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/compat.py", line 143, in frombytes
          return o.decode('utf8')
      UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
      
      'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
      

      Presumably this is because the code always assumes the file is in UTF-8.

      Python 2 behavior

      Python 2 behaves differently – it uses the ASCII codec by default, so when handed a file encoded in UTF-8, it will return without an error. Try to access the table...

      >>> t = pa_csv.read_csv(open('/Users/diegoargueta/Desktop/test.csv', 'r'))
      
      >>> list(t)
      Traceback (most recent call last):
        File "/<redacted>/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py", line 84, in _process_text
          self._execute(line)
        File "<redacted>/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py", line 139, in _execute
          result_str = '%s\n' % repr(result).decode('utf-8')
        File "pyarrow/table.pxi", line 387, in pyarrow.lib.Column.__repr__
          result.write('\n{}'.format(str(self.data)))
      UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 11: ordinal not in range(128)
      
      'ascii' codec can't decode byte 0xe4 in position 11: ordinal not in range(128)
      

      Expectation

      We should be able to hand read_csv() a file in text mode so that the CSV file can be in any text encoding.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              yiannisliodakis Diego Argueta
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: