Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6114

[Python] Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.14.1
    • Fix Version/s: None
    • Component/s: Python
    • Labels:
    • Environment:
      Python 3.7.3
      pyarrow 0.14.1

      Description

      Datatypes are not preserved when a pandas data frame is partitioned and saved as parquet file using pyarrow but that's not the case when the data frame is not partitioned.

      Case 1: Saving a partitioned dataset - Data Types are NOT preserved

      # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
      import pandas as pd
      df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] }
      )
      path = 'test'
      partition_cols=['age']
      print('Datatypes before saving the dataset')
      print(df.dtypes)
      table = pa.Table.from_pandas(df)
      pq.write_to_dataset(table, path, partition_cols=partition_cols, preserve_index=False)
      
       # Loading a dataset partioned parquet dataset from local
      df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
      print('\nDatatypes after loading the dataset')
      print(df.dtypes)
      

      Output:

      Datatypes before saving the dataset
      age int64
      name object
      dtype: object
      
      Datatypes after loading the dataset
      name object
      age category
      dtype: object
      
      From the above output, we could see that the data type for age is int64 in the original pandas data frame but it got changed to category when we saved to local and loaded back.

      Case 2: Non-partitioned dataset - Data types are preserved

      import pandas as pd
      print('Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow')
      df = pd.DataFrame(
      
      {'age': [77,32,234],'name':['agan','bbobby','test'] }
      
      )
      path = 'test_without_partition'
      print('Datatypes before saving the dataset')
      print(df.dtypes)
      table = pa.Table.from_pandas(df)
      pq.write_to_dataset(table, path, preserve_index=False)
       # Loading a non-partioned parquet file from local
      df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
      print('\nDatatypes after loading the dataset')
      print(df.dtypes)
      
      

      Output:

      Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow
      Datatypes before saving the dataset
      age int64
      name object
      dtype: object
      
      Datatypes after loading the dataset
      age int64
      name object
      dtype: object
      

      Versions

      • Python 3.7.3
      • pyarrow 0.14.1

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                bnriiitb Naga
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: