Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6114

[Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.14.1
    • None
    • Python
    • Python 3.7.3
      pyarrow 0.14.1

    Description

      Datatypes are not preserved when a pandas data frame is partitioned and saved as parquet file using pyarrow but that's not the case when the data frame is not partitioned.

      Case 1: Saving a partitioned dataset - Data Types are NOT preserved

      # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
      import pandas as pd
      df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] }
      )
      path = 'test'
      partition_cols=['age']
      print('Datatypes before saving the dataset')
      print(df.dtypes)
      table = pa.Table.from_pandas(df)
      pq.write_to_dataset(table, path, partition_cols=partition_cols, preserve_index=False)
      
       # Loading a dataset partioned parquet dataset from local
      df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
      print('\nDatatypes after loading the dataset')
      print(df.dtypes)
      

      Output:

      Datatypes before saving the dataset
      age int64
      name object
      dtype: object
      
      Datatypes after loading the dataset
      name object
      age category
      dtype: object
      
      From the above output, we could see that the data type for age is int64 in the original pandas data frame but it got changed to category when we saved to local and loaded back.

      Case 2: Non-partitioned dataset - Data types are preserved

      import pandas as pd
      print('Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow')
      df = pd.DataFrame(
      
      {'age': [77,32,234],'name':['agan','bbobby','test'] }
      
      )
      path = 'test_without_partition'
      print('Datatypes before saving the dataset')
      print(df.dtypes)
      table = pa.Table.from_pandas(df)
      pq.write_to_dataset(table, path, preserve_index=False)
       # Loading a non-partioned parquet file from local
      df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
      print('\nDatatypes after loading the dataset')
      print(df.dtypes)
      
      

      Output:

      Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow
      Datatypes before saving the dataset
      age int64
      name object
      dtype: object
      
      Datatypes after loading the dataset
      age int64
      name object
      dtype: object
      

      Versions

      • Python 3.7.3
      • pyarrow 0.14.1

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bnriiitb Naga
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: