Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
None
-
None
-
None
Description
As you can see below, as soon as I partition the parquet dataset, my Int32 type is read back as float64. This seems like a bug to me, as partitioning shouldn't change the datatype, and I loose all the advantages of the nullable int.
import pandas as pd # 1.0.4 import pyarrow as pa # 0.17.1 import pyarrow.parquet as pq x = pd.DataFrame({'a':[1, 2, None, 1], 'b':['x']*4}) x.a = x.a.astype('Int32') tbl = pa.Table.from_pandas(x) pq.write_to_dataset(tbl, 'ok') pq.write_to_dataset(tbl, 'busted', partition_cols=['b']) print(pd.read_parquet('ok').dtypes['a']) # Int32 print(pd.read_parquet('busted').dtypes['a']) # float64
(cross-posted on stackoverflow)
https://stackoverflow.com/questions/62356730/parquet-partitioning-degrades-int32-to-float64
Attachments
Issue Links
- is duplicated by
-
ARROW-8251 [Python] pandas.ExtensionDtype does not survive round trip with write_to_dataset
- Resolved