Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5138

[Python/C++] Row group retrieval doesn't restore index properly

    XMLWordPrintableJSON

    Details

      Description

      When retrieving row groups the index is no longer properly restored to its initial value and is set to an range index starting at zero no matter what. version 0.12.1 restored and int64 index with the correct index values.

      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      print(pa.__version__)
      df = pd.DataFrame(
          {"a": [1, 2, 3, 4]}
      )
      print("total DF")
      print(df.index)
      table = pa.Table.from_pandas(df)
      buf = pa.BufferOutputStream()
      pq.write_table(table, buf, chunk_size=2)
      reader = pa.BufferReader(buf.getvalue().to_pybytes())
      parquet_file = pq.ParquetFile(reader)
      rg = parquet_file.read_row_group(1)
      
      df_restored = rg.to_pandas()
      print("Row group")
      print(df_restored.index)
      

      Previous behavior

      0.12.1
      total DF
      RangeIndex(start=0, stop=4, step=1)
      Row group
      Int64Index([2, 3], dtype='int64')
      

      Behavior now

      0.13.0
      total DF
      RangeIndex(start=0, stop=4, step=1)
      Row group
      RangeIndex(start=0, stop=2, step=1)
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wesm Wes McKinney
                Reporter:
                fjetter Florian Jetter
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m