Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-4131

[Python] Coerce mixed columns to String

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 0.15.0
    • Component/s: Python
    • Labels:
      None

      Description

      Continuing https://github.com/apache/arrow/issues/3280 

       

      ===

       

      I'm seeing variants of this elsewhere (e.g., wesm/feather#349 ) –

      Not all Pandas tables coerce to Arrow tables, and when they fail, not in a way that is conducive to automation:

      Sample:

      {{mixed_df = pd.DataFrame({'mixed': [1, 'b']}) pa.Table.from_pandas(mixed_df) => ArrowInvalid: ('Could not convert b with type str: tried to convert to double', 'Conversion failed for column mixed with type object') }}

      I would have expected behaviors more like the following:

      • Coerce toString by default, with a default-off option to disallow toString coercions
      • Provide a default-off option to from_pandas to auto-coerce
      • Name the exception so it is clear that this is a column coercion failure, and include the column name(s), making this predictable and clearly handleable by both library writers & users

      I lean towards:

      • Defaults auto-coerce, improving life of early users, `coerce_mixed_columns_to_strings=True`
      • For less frequent yet more advanced library implementors, allow them to override to `False`
      • In their case, create a predictable & machine-readable exception, `MixedColumnException(mixed_columns=['a', 'b', ...], msg="....")`

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              lmeyerov Leo Meyerovich
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: