Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1940

[Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table

    Details

      Description

      We have a unit test that verifies that loading a dataframe from a .parq file and saving it back with no changes produces the same result as the original file. It started failing with pyarrow 0.8.0.

      After digging into it, I discovered that after the first conversion from pd.DataFrame to pa.Table, the table contains the following metadata (among other things):

      "column_indexes": [{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]
      

      However, after converting it to pd.DataFrame and back into a pa.Table for the second time, the metadata gets an encoding field:

      "column_indexes": [{"metadata": {"encoding": "UTF-8"}, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "unicode"}]
      

      See the attached file for a test case.

      So specifically, it appears that dataframe->table->dataframe->table conversion produces a different result from just dataframe->table - which I think is unexpected.

        Attachments

        1. fail.py
          0.5 kB
          Dima Ryazanov

          Issue Links

            Activity

              People

              • Assignee:
                cpcloud Phillip Cloud
                Reporter:
                dimaryaz Dima Ryazanov
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: