[ARROW-8251] [Python] pandas.ExtensionDtype does not survive round trip with write_to_dataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.16.0
Fix Version/s: 1.0.0
Component/s: Python
Labels:
- pull-request-available
Environment:
pandas 1.0.1
parquet 0.16

External issue URL:
https://github.com/apache/arrow/issues/24447

Description

write_to_dataset with pandas fields using pandas.ExtensionDtype nullable int or string produce parquet file which when read back in has different dtypes than original df

import pandas as pd 
import pyarrow as pa 
import pyarrow.parquet as pq 
parquet_dataset = 'partquet_dataset/' 
parquet_file = 'test.parquet' 

df = pd.DataFrame([{'str_col':'abc','int_col':1,'part':1}, {'str_col':np.nan,'int_col':np.nan,'part':1}]) 
df['str_col'] = df['str_col'].astype(pd.StringDtype()) 
df['int_col'] = df['int_col'].astype(pd.Int64Dtype()) 

table = pa.Table.from_pandas(df) 

pq.write_to_dataset(table, root_path=parquet_dataset, partition_cols=['part'] ) pq.write_table(table, where=parquet_file)

write_table handles schema correctly, pandas.ExtensionDtype survive round trip:

pq.read_table(parquet_file).to_pandas().dtypes 
str_col string 
int_col Int64 
part int64

However, write_to_dataset reverts back to object/float:

pq.read_table(parquet_dataset).to_pandas().dtypes 
str_col object 
int_col float64 
part category

I have also tried writing common metadata at the top-level directory of a partitioned dataset and then passing metadata to read_table, but results are the same as without metadata

pq.write_metadata(table.schema, parquet_dataset+'_common_metadata', version='2.0') meta = pq.read_metadata(parquet_dataset+'_common_metadata') pq.read_table(parquet_dataset,metadata=meta).to_pandas().dtypes

This also affects pandas to_parquet when partition_cols is specified:

df.to_parquet(path = parquet_dataset, partition_cols=['part']) pd.read_parquet(parquet_dataset).dtypes 
str_col object 
int_col float64 
part category

Attachments

Issue Links

duplicates

ARROW-9134 [Python] Parquet partitioning degrades Int32 to float64

Closed

links to

GitHub Pull Request #7054

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Ged Steponavicius

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 28/Mar/20 10:19

Updated:: 11/Jan/23 07:59

Resolved:: 29/Apr/20 02:43

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m