Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3650

[Python] Mixed column indexes are read back as strings

    XMLWordPrintableJSON

Details

    Description

      Consider the following example:

      df = pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['a string', pd.to_datetime('2018/01/02')])
      
      table = pa.Table.from_pandas(df)
      pq.write_table(table, 'test.parquet')
      
      ref_df = pq.read_pandas('test.parquet').to_pandas()
      
      print(df.columns)
      # Index(['a string', 2018-01-02 00:00:00], dtype='object')
      print(ref_df.columns)
      # Index(['a string', '2018-01-02 00:00:00'], dtype='object')
      

      The serialized data frame has an index with a string and a datetime field (happened when resetting the index of a formerly datetime only column).
      When reading the string back the datetime is converted into a string.

      When looking at the schema I find {{"pandas_type": "mixed", "numpy_ty'
      b'pe": "object"}} before serializing and {{"pandas_type": "unicode", "numpy_'
      b'type": "object"}} after reading back. So the schema was aware of the mixed type but did not store the actual types.

      The same happens with other types like numbers as well. One can produce interesting situations:

      pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['1', 1]) can be written but fails to be read back as the index is no more unique with '1' showing up two times.

      IIf this is not a bug but expected maybe the user should be somehow warned that information is lost? Like a NotImplemented exception.

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              aberres Armin Berres
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m