[ARROW-2799] [Python] Add safe option to Table.from_pandas to avoid unsafe casts - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.0
Fix Version/s: 0.11.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/19180

Description

Ported over from https://github.com/apache/arrow/issues/2217

```python
In [8]: import pandas as pd
...: import pyarrow as arw

In [9]: df = pd.DataFrame(

{'A': list('abc'), 'B': np.arange(3)}

)
...: df
Out[9]:
A B
0 a 0
1 b 1
2 c 2

In [10]: schema = arw.schema([
...: arw.field('A', arw.string()),
...: arw.field('B', arw.int32()),
...: ])

In [11]: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
...: tbl
Out[11]:
pyarrow.Table
A: string
B: int32
metadata
--------
{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [

{"name":' b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": null}

{"name": "B", "field_name": "B", "' b'pandas_type": "int32", "numpy_type": "int32", "metadata": null}

]'
b', "pandas_version": "0.23.1"}'}

In [12]: tbl.to_pandas().equals(df)
Out[12]: True
```
...so if the `schema` matches the pandas datatypes all is well - we can roundtrip the DataFrame.

Now, say we have some bad data such that column 'B' is now of type float64. The datatypes of the DataFrame don't match the explicitly supplied `schema` object but rather than raising a `TypeError` the data is silently truncated and the roundtrip DataFrame doesn't match our input DataFame without even a warning raised!
```python
In [13]: df['B'].iloc[0] = 1.23
...: df
Out[13]:
A B
0 a 1.23
1 b 1.00
2 c 2.00

In [14]: # I would expect/want this to raise a TypeError since the schema doesn't match the pandas datatypes
...: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
...: tbl
Out[14]:
pyarrow.Table
A: string
B: int32
metadata
--------
{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [

{"name":' b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": null}

{"name": "B", "field_name": "B", "' b'pandas_type": "int32", "numpy_type": "float64", "metadata": null' b'}

], "pandas_version": "0.23.1"}'}

In [15]: tbl.to_pandas() # <-- SILENT TRUNCATION!!!
Out[15]:
A B
0 a 1
1 b 1
2 c 2

```

To be clear, I would really like `Table.from_pandas` to raise a `TypeError` if the DataFrame types don't match an explicitly supplied schema and would hope this current behaviour would be considered a bug.

Attachments

Issue Links

depends upon

ARROW-1949 [Python/C++] Add option to Array.from_pandas and pyarrow.array to perform unsafe casts

Resolved

ARROW-3158 [C++] Handle float truncation during casting

Resolved

links to

GitHub Pull Request #2504

Activity

People

Assignee:: Krisztian Szucs

Reporter:: Dave Hirschfeld

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 06/Jul/18 10:16

Updated:: 11/Jan/23 07:23

Resolved:: 08/Sep/18 16:08

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 50m