XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.12.1
Fix Version/s: 0.15.0
Component/s: Python
Labels:
- csv
Environment:
Python: 3.7.2, 2.7.15
PyArrow: 0.12.1
OS: MacOS 10.13.6 (High Sierra)

External issue URL:
https://github.com/apache/arrow/issues/21393

Description

Summary:

Python 3:

read_csv returns mojibake if given file objects opened in text mode. It behaves as expected in binary mode.
Files encoded in anything other than valid UTF-8 will cause a crash.

Python 2:

read_csv only handles ASCII files. If given a file in UTF-8 with characters over U+007F, it crashes.

To reproduce:

1) Create a CSV like this

Header
123.45

2) Then run this code on Python 3:

>>> import pyarrow.csv as pa_csv
>>> pa_csv.read_csv(open('test.csv', 'r'))
pyarrow.Table
䧢: string

Notice the file descriptor is open in text mode. Changing the encoding doesn't help:

>>> pa_csv.read_csv(open('test.csv', 'r', encoding='utf-8'))
pyarrow.Table
䧢: string

>>> pa_csv.read_csv(open('test.csv', 'r', encoding='ascii'))
pyarrow.Table
䧢: string

>>> pa_csv.read_csv(open('test.csv', 'r', encoding='iso-8859-1'))
pyarrow.Table
䧢: string

If I open the file in binary mode it works:

>>> pa_csv.read_csv(open('test.csv', 'rb'))                                                                                                                             
pyarrow.Table
Header: double

I tried this with a file encoded in UTF-16 and it freaked out:

                                                                                                                  
Traceback (most recent call last):
  File "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py", line 84, in _process_text
    self._execute(line)
  File "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/ptpython/repl.py", line 139, in _execute
    result_str = '%s\n' % repr(result).decode('utf-8')
  File "pyarrow/table.pxi", line 960, in pyarrow.lib.Table.__repr__
  File "pyarrow/types.pxi", line 903, in pyarrow.lib.Schema.__str__
  File "<redacted>/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/compat.py", line 143, in frombytes
    return o.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Presumably this is because the code always assumes the file is in UTF-8.

Python 2 behavior

Python 2 behaves differently – it uses the ASCII codec by default, so when handed a file encoded in UTF-8, it will return without an error. Try to access the table...

>>> t = pa_csv.read_csv(open('/Users/diegoargueta/Desktop/test.csv', 'r'))

>>> list(t)
Traceback (most recent call last):
  File "/<redacted>/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py", line 84, in _process_text
    self._execute(line)
  File "<redacted>/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/ptpython/repl.py", line 139, in _execute
    result_str = '%s\n' % repr(result).decode('utf-8')
  File "pyarrow/table.pxi", line 387, in pyarrow.lib.Column.__repr__
    result.write('\n{}'.format(str(self.data)))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 11: ordinal not in range(128)

'ascii' codec can't decode byte 0xe4 in position 11: ordinal not in range(128)

Expectation

We should be able to hand read_csv() a file in text mode so that the CSV file can be in any text encoding.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Diego Argueta

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 15/Mar/19 01:08

Updated:: 11/Jan/23 07:36

Resolved:: 18/Sep/19 16:38