Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
1.0.1
-
arrow 1.0.1
parquet 1.5.1
pandas 1.1.0
pyarrow 1.0.1
Description
When I read one column of a multi-index, that column is duplicated as a value column in the resulting Pandas data frame.
>> tbl = pa.table({"first": list(range(5)), "second": list(range(5)), "value": np.arange(5)}) >>> df = table.to_pandas().set_index(["first", "second"]) >>> print(df) value first second 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 >>> pq.write_table(pa.Table.from_pandas(df), "/tmp/test.parquet") >>> data = ds.dataset("/tmp/test.parquet")
This works as expected, as does selecting all or no columns.
>>> print(data.to_table(columns=["first", "second", "value"]).to_pandas()) value first second 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4
This does not work as expected, as the first column is both an index and a value.
>>> print(data.to_table(columns=["first", "value"]).to_pandas()) first value first 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4
This is easy to workaround by specifying the full multi-index in to_table, but does this behavior make sense?
Attachments
Issue Links
- links to