Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.16.0
-
pandas 1.0.1
parquet 0.16
Description
write_to_dataset with pandas fields using pandas.ExtensionDtype nullable int or string produce parquet file which when read back in has different dtypes than original df
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq parquet_dataset = 'partquet_dataset/' parquet_file = 'test.parquet' df = pd.DataFrame([{'str_col':'abc','int_col':1,'part':1}, {'str_col':np.nan,'int_col':np.nan,'part':1}]) df['str_col'] = df['str_col'].astype(pd.StringDtype()) df['int_col'] = df['int_col'].astype(pd.Int64Dtype()) table = pa.Table.from_pandas(df) pq.write_to_dataset(table, root_path=parquet_dataset, partition_cols=['part'] ) pq.write_table(table, where=parquet_file)
write_table handles schema correctly, pandas.ExtensionDtype survive round trip:
pq.read_table(parquet_file).to_pandas().dtypes str_col string int_col Int64 part int64
However, write_to_dataset reverts back to object/float:
pq.read_table(parquet_dataset).to_pandas().dtypes str_col object int_col float64 part category
I have also tried writing common metadata at the top-level directory of a partitioned dataset and then passing metadata to read_table, but results are the same as without metadata
pq.write_metadata(table.schema, parquet_dataset+'_common_metadata', version='2.0') meta = pq.read_metadata(parquet_dataset+'_common_metadata') pq.read_table(parquet_dataset,metadata=meta).to_pandas().dtypes
This also affects pandas to_parquet when partition_cols is specified:
df.to_parquet(path = parquet_dataset, partition_cols=['part']) pd.read_parquet(parquet_dataset).dtypes
str_col object
int_col float64
part category
Attachments
Issue Links
- duplicates
-
ARROW-9134 [Python] Parquet partitioning degrades Int32 to float64
-
- Closed
-
- links to