[ARROW-3650] [Python] Mixed column indexes are read back as strings - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.11.1
Fix Version/s: 0.14.0
Component/s: Python
Labels:
- parquet
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/19957

Description

Consider the following example:

df = pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['a string', pd.to_datetime('2018/01/02')])

table = pa.Table.from_pandas(df)
pq.write_table(table, 'test.parquet')

ref_df = pq.read_pandas('test.parquet').to_pandas()

print(df.columns)
# Index(['a string', 2018-01-02 00:00:00], dtype='object')
print(ref_df.columns)
# Index(['a string', '2018-01-02 00:00:00'], dtype='object')

The serialized data frame has an index with a string and a datetime field (happened when resetting the index of a formerly datetime only column).
When reading the string back the datetime is converted into a string.

When looking at the schema I find {{"pandas_type": "mixed", "numpy_ty'
b'pe": "object"}} before serializing and {{"pandas_type": "unicode", "numpy_'
b'type": "object"}} after reading back. So the schema was aware of the mixed type but did not store the actual types.

The same happens with other types like numbers as well. One can produce interesting situations:

pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['1', 1]) can be written but fails to be read back as the index is no more unique with '1' showing up two times.

IIf this is not a bug but expected maybe the user should be somehow warned that information is lost? Like a NotImplemented exception.

Attachments

Issue Links

links to

GitHub Pull Request #4244

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Armin Berres

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 30/Oct/18 08:44

Updated:: 11/Jan/23 07:28

Resolved:: 11/Jun/19 17:59

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 10m