[ARROW-374] Python: clarify unicode vs. binary in API - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.1.0
Fix Version/s: 0.2.0
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/15942

Description

pyarrow supports arrow's String type, arrow-internally represented as BINARY+UTF8 annotation.

In python 2, the pyarrow API accept both unicode and binary strings (str), where the latter are assumed to be utf-8 encoded. I find this approach problematic, because:

there is an implicit assumption that a binary str contains valid utf-8 data. This assumption can be wrong, however, and it's not clear what the consequences are of passing such "invalid data" to the API are.
the utf-8 assumption is not clearly documented or otherwise visible from the API
if pyarrow wants to support pure binary data in the future, a natural choice would be to use str as python2 type. However, this would conflict with the current interpretation of binary str as BINARY+UTF8

Proposed solution
I propose to change the API that it only accepts or returns unicode strings, i.e. python2's unicode and python3's str. Passing a python2 str should raise an exception, same for python3's bytes.
If in some point in the future also raw BINARY is supported, use python3's bytes and python2's str.

As convenience feature for API users, the API may allow to also pass utf-8 encoded binary data as arrow's String, but that should be an explicit, opt-in choice, s.t. API users are aware of the (encoding-)assumptions made.

Attachments

Issue Links

is related to

ARROW-434 Segfaults and encoding issues in Python Parquet reads

Resolved

Activity

People

Assignee:: Wes McKinney

Reporter:: Jochen Ott

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Nov/16 07:27

Updated:: 11/Jan/23 07:08

Resolved:: 21/Dec/16 08:32