pyarrow supports arrow's String type, arrow-internally represented as BINARY+UTF8 annotation.
In python 2, the pyarrow API accept both unicode and binary strings (str), where the latter are assumed to be utf-8 encoded. I find this approach problematic, because:
- there is an implicit assumption that a binary str contains valid utf-8 data. This assumption can be wrong, however, and it's not clear what the consequences are of passing such "invalid data" to the API are.
- the utf-8 assumption is not clearly documented or otherwise visible from the API
- if pyarrow wants to support pure binary data in the future, a natural choice would be to use str as python2 type. However, this would conflict with the current interpretation of binary str as BINARY+UTF8
I propose to change the API that it only accepts or returns unicode strings, i.e. python2's unicode and python3's str. Passing a python2 str should raise an exception, same for python3's bytes.
If in some point in the future also raw BINARY is supported, use python3's bytes and python2's str.
As convenience feature for API users, the API may allow to also pass utf-8 encoded binary data as arrow's String, but that should be an explicit, opt-in choice, s.t. API users are aware of the (encoding-)assumptions made.