Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.12.1
-
Python: 3.7.2, 2.7.15
PyArrow: 0.12.1
OS: MacOS 10.13.6 (High Sierra)
Description
Summary:
Python 3:
- read_csv returns mojibake if given file objects opened in text mode. It behaves as expected in binary mode.
- Files encoded in anything other than valid UTF-8 will cause a crash.
Python 2:
read_csv only handles ASCII files. If given a file in UTF-8 with characters over U+007F, it crashes.
To reproduce:
1) Create a CSV like this
Header 123.45
2) Then run this code on Python 3:
>>> import pyarrow.csv as pa_csv >>> pa_csv.read_csv(open('test.csv', 'r')) pyarrow.Table 䧢: string
Notice the file descriptor is open in text mode. Changing the encoding doesn't help:
>>> pa_csv.read_csv(open('test.csv', 'r', encoding='utf-8')) pyarrow.Table 䧢: string >>> pa_csv.read_csv(open('test.csv', 'r', encoding='ascii')) pyarrow.Table 䧢: string >>> pa_csv.read_csv(open('test.csv', 'r', encoding='iso-8859-1')) pyarrow.Table 䧢: string
If I open the file in binary mode it works:
>>> pa_csv.read_csv(open('test.csv', 'rb')) pyarrow.Table Header: double
I tried this with a file encoded in UTF-16 and it freaked out:
Traceback (most recent call last): File "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py", line 84, in _process_text self._execute(line) File "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py", line 139, in _execute result_str = '%s\n' % repr(result).decode('utf-8') File "pyarrow/table.pxi", line 960, in pyarrow.lib.Table.__repr__ File "pyarrow/types.pxi", line 903, in pyarrow.lib.Schema.__str__ File "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/compat.py", line 143, in frombytes return o.decode('utf8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Presumably this is because the code always assumes the file is in UTF-8.
Python 2 behavior
Python 2 behaves differently – it uses the ASCII codec by default, so when handed a file encoded in UTF-8, it will return without an error. Try to access the table...
>>> t = pa_csv.read_csv(open('/Users/diegoargueta/Desktop/test.csv', 'r')) >>> list(t) Traceback (most recent call last): File "/<redacted>/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py", line 84, in _process_text self._execute(line) File "<redacted>/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py", line 139, in _execute result_str = '%s\n' % repr(result).decode('utf-8') File "pyarrow/table.pxi", line 387, in pyarrow.lib.Column.__repr__ result.write('\n{}'.format(str(self.data))) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 11: ordinal not in range(128) 'ascii' codec can't decode byte 0xe4 in position 11: ordinal not in range(128)
Expectation
We should be able to hand read_csv() a file in text mode so that the CSV file can be in any text encoding.