Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1940

[Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table

    XMLWordPrintableJSON

Details

    Description

      We have a unit test that verifies that loading a dataframe from a .parq file and saving it back with no changes produces the same result as the original file. It started failing with pyarrow 0.8.0.

      After digging into it, I discovered that after the first conversion from pd.DataFrame to pa.Table, the table contains the following metadata (among other things):

      "column_indexes": [{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]
      

      However, after converting it to pd.DataFrame and back into a pa.Table for the second time, the metadata gets an encoding field:

      "column_indexes": [{"metadata": {"encoding": "UTF-8"}, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "unicode"}]
      

      See the attached file for a test case.

      So specifically, it appears that dataframe->table->dataframe->table conversion produces a different result from just dataframe->table - which I think is unexpected.

      Attachments

        1. fail.py
          0.5 kB
          Dima Ryazanov

        Issue Links

          Activity

            People

              cpcloud Phillip Cloud
              dimaryaz Dima Ryazanov
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: