Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Continuing https://github.com/apache/arrow/issues/3280
===
I'm seeing variants of this elsewhere (e.g., wesm/feather#349 ) –
Not all Pandas tables coerce to Arrow tables, and when they fail, not in a way that is conducive to automation:
Sample:
{{mixed_df = pd.DataFrame({'mixed': [1, 'b']}) pa.Table.from_pandas(mixed_df) => ArrowInvalid: ('Could not convert b with type str: tried to convert to double', 'Conversion failed for column mixed with type object') }}
I would have expected behaviors more like the following:
- Coerce toString by default, with a default-off option to disallow toString coercions
- Provide a default-off option to from_pandas to auto-coerce
- Name the exception so it is clear that this is a column coercion failure, and include the column name(s), making this predictable and clearly handleable by both library writers & users
I lean towards:
- Defaults auto-coerce, improving life of early users, `coerce_mixed_columns_to_strings=True`
- For less frequent yet more advanced library implementors, allow them to override to `False`
- In their case, create a predictable & machine-readable exception, `MixedColumnException(mixed_columns=['a', 'b', ...], msg="....")`