[ARROW-10122] [Python] Selecting one column of multi-index results in a duplicated value column. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.0.1
Fix Version/s: 3.0.0
Component/s: Python
Labels:
- pull-request-available
Environment:
arrow 1.0.1
parquet 1.5.1
pandas 1.1.0
pyarrow 1.0.1

External issue URL:
https://github.com/apache/arrow/issues/26134

Description

When I read one column of a multi-index, that column is duplicated as a value column in the resulting Pandas data frame.

>> tbl = pa.table({"first": list(range(5)), "second": list(range(5)), "value": np.arange(5)}) 
>>> df = table.to_pandas().set_index(["first", "second"])
>>> print(df)
              value
first second
0     0           0
1     1           1
2     2           2
3     3           3
4     4           4
>>> pq.write_table(pa.Table.from_pandas(df), "/tmp/test.parquet")
>>> data = ds.dataset("/tmp/test.parquet")

This works as expected, as does selecting all or no columns.

>>> print(data.to_table(columns=["first", "second", "value"]).to_pandas())
              value
first second
0     0           0
1     1           1
2     2           2
3     3           3
4     4           4

This does not work as expected, as the first column is both an index and a value.

>>> print(data.to_table(columns=["first", "value"]).to_pandas())
       first  value
first
0          0      0
1          1      1
2          2      2
3          3      3
4          4      4

This is easy to workaround by specifying the full multi-index in to_table, but does this behavior make sense?

Attachments

Issue Links

links to

GitHub Pull Request #8469

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Troy Zimmerman

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Sep/20 18:33

Updated:: 11/Jan/23 08:11

Resolved:: 19/Nov/20 08:41

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m