Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10122

[Python] Selecting one column of multi-index results in a duplicated value column.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.0.1
    • 3.0.0
    • Python
    • arrow 1.0.1
      parquet 1.5.1
      pandas 1.1.0
      pyarrow 1.0.1

    Description

      When I read one column of a multi-index, that column is duplicated as a value column in the resulting Pandas data frame.

      >> tbl = pa.table({"first": list(range(5)), "second": list(range(5)), "value": np.arange(5)}) 
      >>> df = table.to_pandas().set_index(["first", "second"])
      >>> print(df)
                    value
      first second
      0     0           0
      1     1           1
      2     2           2
      3     3           3
      4     4           4
      >>> pq.write_table(pa.Table.from_pandas(df), "/tmp/test.parquet")
      >>> data = ds.dataset("/tmp/test.parquet")
      

      This works as expected, as does selecting all or no columns.

      >>> print(data.to_table(columns=["first", "second", "value"]).to_pandas())
                    value
      first second
      0     0           0
      1     1           1
      2     2           2
      3     3           3
      4     4           4
      

      This does not work as expected, as the first column is both an index and a value.

      >>> print(data.to_table(columns=["first", "value"]).to_pandas())
             first  value
      first
      0          0      0
      1          1      1
      2          2      2
      3          3      3
      4          4      4

      This is easy to workaround by specifying the full multi-index in to_table, but does this behavior make sense?

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              tazimmerman Troy Zimmerman
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m