Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-374

Python: clarify unicode vs. binary in API

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 0.1.0
    • 0.2.0
    • Python
    • None

    Description

      pyarrow supports arrow's String type, arrow-internally represented as BINARY+UTF8 annotation.

      In python 2, the pyarrow API accept both unicode and binary strings (str), where the latter are assumed to be utf-8 encoded. I find this approach problematic, because:

      • there is an implicit assumption that a binary str contains valid utf-8 data. This assumption can be wrong, however, and it's not clear what the consequences are of passing such "invalid data" to the API are.
      • the utf-8 assumption is not clearly documented or otherwise visible from the API
      • if pyarrow wants to support pure binary data in the future, a natural choice would be to use str as python2 type. However, this would conflict with the current interpretation of binary str as BINARY+UTF8

      Proposed solution
      I propose to change the API that it only accepts or returns unicode strings, i.e. python2's unicode and python3's str. Passing a python2 str should raise an exception, same for python3's bytes.
      If in some point in the future also raw BINARY is supported, use python3's bytes and python2's str.

      As convenience feature for API users, the API may allow to also pass utf-8 encoded binary data as arrow's String, but that should be an explicit, opt-in choice, s.t. API users are aware of the (encoding-)assumptions made.

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              jott Jochen Ott
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: