Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9134

[Python] Parquet partitioning degrades Int32 to float64

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • 1.0.0
    • None
    • None

    Description

      As you can see below, as soon as I partition the parquet dataset, my Int32 type is read back as float64. This seems like a bug to me, as partitioning shouldn't change the datatype, and I loose all the advantages of the nullable int.

       

      import pandas as pd # 1.0.4
      import pyarrow as pa # 0.17.1
      import pyarrow.parquet as pq
      
      x = pd.DataFrame({'a':[1, 2, None, 1], 'b':['x']*4})
      x.a = x.a.astype('Int32')
      tbl = pa.Table.from_pandas(x)
      pq.write_to_dataset(tbl, 'ok')
      pq.write_to_dataset(tbl, 'busted', partition_cols=['b'])
      
      print(pd.read_parquet('ok').dtypes['a'])  # Int32
      print(pd.read_parquet('busted').dtypes['a'])  # float64
      

       

      (cross-posted on stackoverflow) 

      https://stackoverflow.com/questions/62356730/parquet-partitioning-degrades-int32-to-float64

       

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              npalko Nicholas Palko
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: