Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9302

[Python] Specifying columns in a dataset drops the index (pandas) metadata.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Python

    Description

      I'm not sure if this is a missing feature, or just undocumented, or perhaps not even something I should expect to work.

      Let's start with a multi-index dataframe.

      >>> import pyarrow as pa
      >>> import pyarrow.dataset as ds
      >>> import pyarrow.parquet as pq
      >>>
      >>> df
                     data  id                      when
      letter number
      a      1        0.0  a1 2020-05-05 08:30:01+00:00
      b      2        1.1  b2 2020-05-05 08:30:01+00:00
             3        1.2  b3 2020-05-05 08:30:01+00:00
      c      4        2.1  c4 2020-05-05 08:30:01+00:00
             5        2.2  c5 2020-05-05 08:30:01+00:00
             6        2.3  c6 2020-05-05 08:30:01+00:00
      
      >>> tbl = pa.Table.from_pandas(df)
      >>> tbl
      pyarrow.Table
      data: double
      id: string
      when: timestamp[ns, tz=+00:00]
      letter: string
      number: int64
      >>> tbl.schema
      data: double
      id: string
      when: timestamp[ns, tz=+00:00]
      letter: string
      number: int64
      -- schema metadata --
      pandas: '{"index_columns": ["letter", "number"], "column_indexes": [{"nam' + 783
      

      This of course works as expected, so let's write the table to disk, and read it with a dataset.

      >>> pq.write_table(tbl, "/tmp/df.parquet")
      >>> data = ds.dataset("/tmp/df.parquet")
      >>> data.to_table(filter=ds.field("letter") == "c").to_pandas()
                     data  id                      when
      letter number
      c      4        2.1  c4 2020-05-05 08:30:01+00:00
             5        2.2  c5 2020-05-05 08:30:01+00:00
             6        2.3  c6 2020-05-05 08:30:01+00:00
      

      The filter also works as expected, and the dataframe is reconstructed properly. Let's do it again, but this time with a column selection.

      >>> data.to_table(filter=ds.field("letter") == "c", columns=["data", "id"]).to_pandas()
         data  id
      0   2.1  c4
      1   2.2  c5
      2   2.3  c6
      

      Hmm, not quite what I was thinking, but excluding the indices from the columns seems like a dumb move on my part, so let's try again, and this time include all columns to be safe.

      >>> tbl = data.to_table(filter=ds.field("letter") == "c", columns=["letter", "number", "data", "id", "when"])
      >>> tbl.to_pandas()
        letter  number  data  id                      when
      0      c       4   2.1  c4 2020-05-05 08:30:01+00:00
      1      c       5   2.2  c5 2020-05-05 08:30:01+00:00
      2      c       6   2.3  c6 2020-05-05 08:30:01+00:00
      >>> tbl
      pyarrow.Table
      letter: string
      number: int64
      data: double
      id: string
      when: timestamp[us, tz=UTC]
      >>> tbl.schema
      letter: string
        -- field metadata --
        PARQUET:field_id: '4'
      number: int64
        -- field metadata --
        PARQUET:field_id: '5'
      data: double
        -- field metadata --
        PARQUET:field_id: '1'
      id: string
        -- field metadata --
        PARQUET:field_id: '2'
      when: timestamp[us, tz=UTC]
        -- field metadata --
        PARQUET:field_id: '3'
      

      It seems that when I specify any or all columns, the schema metadata is lost along the way, so to_pandas doesn't reconstruct the dataframe to match the original.

      Here's my relevant versions:

      • arrow-cpp: 0.17.1
      • pyarrow: 0.17.1
      • parquet-cpp: 1.5.1
      • python: 3.7.6
      • thrift-cpp: 0.13.0

      Attachments

        Activity

          People

            Unassigned Unassigned
            tazimmerman Troy Zimmerman
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: