Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8251

[Python] pandas.ExtensionDtype does not survive round trip with write_to_dataset

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.16.0
    • 1.0.0
    • Python
    • pandas 1.0.1
      parquet 0.16

    Description

      write_to_dataset with pandas fields using pandas.ExtensionDtype nullable int or string produce parquet file which when read back in has different dtypes than original df

      import pandas as pd 
      import pyarrow as pa 
      import pyarrow.parquet as pq 
      parquet_dataset = 'partquet_dataset/' 
      parquet_file = 'test.parquet' 
      
      df = pd.DataFrame([{'str_col':'abc','int_col':1,'part':1}, {'str_col':np.nan,'int_col':np.nan,'part':1}]) 
      df['str_col'] = df['str_col'].astype(pd.StringDtype()) 
      df['int_col'] = df['int_col'].astype(pd.Int64Dtype()) 
      
      table = pa.Table.from_pandas(df) 
      
      pq.write_to_dataset(table, root_path=parquet_dataset, partition_cols=['part'] ) pq.write_table(table, where=parquet_file) 

      write_table handles schema correctly, pandas.ExtensionDtype survive round trip:

      pq.read_table(parquet_file).to_pandas().dtypes 
      str_col string 
      int_col Int64 
      part int64 

      However, write_to_dataset reverts back to object/float:

      pq.read_table(parquet_dataset).to_pandas().dtypes 
      str_col object 
      int_col float64 
      part category 

      I have also tried writing common metadata at the top-level directory of a partitioned dataset and then passing metadata to read_table, but results are the same as without metadata

      pq.write_metadata(table.schema, parquet_dataset+'_common_metadata', version='2.0') meta = pq.read_metadata(parquet_dataset+'_common_metadata') pq.read_table(parquet_dataset,metadata=meta).to_pandas().dtypes 

      This also affects pandas to_parquet when partition_cols is specified:

      df.to_parquet(path = parquet_dataset, partition_cols=['part']) pd.read_parquet(parquet_dataset).dtypes 
      str_col object 
      int_col float64 
      part category 

       

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              Ged.Steponavicius Ged Steponavicius
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m