Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
0.13.0
Description
When retrieving row groups the index is no longer properly restored to its initial value and is set to an range index starting at zero no matter what. version 0.12.1 restored and int64 index with the correct index values.
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq print(pa.__version__) df = pd.DataFrame( {"a": [1, 2, 3, 4]} ) print("total DF") print(df.index) table = pa.Table.from_pandas(df) buf = pa.BufferOutputStream() pq.write_table(table, buf, chunk_size=2) reader = pa.BufferReader(buf.getvalue().to_pybytes()) parquet_file = pq.ParquetFile(reader) rg = parquet_file.read_row_group(1) df_restored = rg.to_pandas() print("Row group") print(df_restored.index)
Previous behavior
0.12.1
total DF
RangeIndex(start=0, stop=4, step=1)
Row group
Int64Index([2, 3], dtype='int64')
Behavior now
0.13.0 total DF RangeIndex(start=0, stop=4, step=1) Row group RangeIndex(start=0, stop=2, step=1)
Attachments
Issue Links
- is related to
-
ARROW-5427 [Python] RangeIndex serialization change implications
- Resolved
- links to