[ARROW-9134] [Python] Parquet partitioning degrades Int32 to float64 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: None
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/25245

Description

As you can see below, as soon as I partition the parquet dataset, my Int32 type is read back as float64. This seems like a bug to me, as partitioning shouldn't change the datatype, and I loose all the advantages of the nullable int.

import pandas as pd # 1.0.4
import pyarrow as pa # 0.17.1
import pyarrow.parquet as pq

x = pd.DataFrame({'a':[1, 2, None, 1], 'b':['x']*4})
x.a = x.a.astype('Int32')
tbl = pa.Table.from_pandas(x)
pq.write_to_dataset(tbl, 'ok')
pq.write_to_dataset(tbl, 'busted', partition_cols=['b'])

print(pd.read_parquet('ok').dtypes['a'])  # Int32
print(pd.read_parquet('busted').dtypes['a'])  # float64

(cross-posted on stackoverflow)

https://stackoverflow.com/questions/62356730/parquet-partitioning-degrades-int32-to-float64

Attachments

Issue Links

is duplicated by

ARROW-8251 [Python] pandas.ExtensionDtype does not survive round trip with write_to_dataset

Resolved

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Nicholas Palko

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 15/Jun/20 13:01

Updated:: 11/Jan/23 08:04

Resolved:: 08/Jul/20 08:17