Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
I'm not sure if this is a missing feature, or just undocumented, or perhaps not even something I should expect to work.
Let's start with a multi-index dataframe.
>>> import pyarrow as pa >>> import pyarrow.dataset as ds >>> import pyarrow.parquet as pq >>> >>> df data id when letter number a 1 0.0 a1 2020-05-05 08:30:01+00:00 b 2 1.1 b2 2020-05-05 08:30:01+00:00 3 1.2 b3 2020-05-05 08:30:01+00:00 c 4 2.1 c4 2020-05-05 08:30:01+00:00 5 2.2 c5 2020-05-05 08:30:01+00:00 6 2.3 c6 2020-05-05 08:30:01+00:00 >>> tbl = pa.Table.from_pandas(df) >>> tbl pyarrow.Table data: double id: string when: timestamp[ns, tz=+00:00] letter: string number: int64 >>> tbl.schema data: double id: string when: timestamp[ns, tz=+00:00] letter: string number: int64 -- schema metadata -- pandas: '{"index_columns": ["letter", "number"], "column_indexes": [{"nam' + 783
This of course works as expected, so let's write the table to disk, and read it with a dataset.
>>> pq.write_table(tbl, "/tmp/df.parquet") >>> data = ds.dataset("/tmp/df.parquet") >>> data.to_table(filter=ds.field("letter") == "c").to_pandas() data id when letter number c 4 2.1 c4 2020-05-05 08:30:01+00:00 5 2.2 c5 2020-05-05 08:30:01+00:00 6 2.3 c6 2020-05-05 08:30:01+00:00
The filter also works as expected, and the dataframe is reconstructed properly. Let's do it again, but this time with a column selection.
>>> data.to_table(filter=ds.field("letter") == "c", columns=["data", "id"]).to_pandas() data id 0 2.1 c4 1 2.2 c5 2 2.3 c6
Hmm, not quite what I was thinking, but excluding the indices from the columns seems like a dumb move on my part, so let's try again, and this time include all columns to be safe.
>>> tbl = data.to_table(filter=ds.field("letter") == "c", columns=["letter", "number", "data", "id", "when"]) >>> tbl.to_pandas() letter number data id when 0 c 4 2.1 c4 2020-05-05 08:30:01+00:00 1 c 5 2.2 c5 2020-05-05 08:30:01+00:00 2 c 6 2.3 c6 2020-05-05 08:30:01+00:00 >>> tbl pyarrow.Table letter: string number: int64 data: double id: string when: timestamp[us, tz=UTC] >>> tbl.schema letter: string -- field metadata -- PARQUET:field_id: '4' number: int64 -- field metadata -- PARQUET:field_id: '5' data: double -- field metadata -- PARQUET:field_id: '1' id: string -- field metadata -- PARQUET:field_id: '2' when: timestamp[us, tz=UTC] -- field metadata -- PARQUET:field_id: '3'
It seems that when I specify any or all columns, the schema metadata is lost along the way, so to_pandas doesn't reconstruct the dataframe to match the original.
Here's my relevant versions:
- arrow-cpp: 0.17.1
- pyarrow: 0.17.1
- parquet-cpp: 1.5.1
- python: 3.7.6
- thrift-cpp: 0.13.0