[ARROW-5427] [Python] RangeIndex serialization change implications - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.13.0
Fix Version/s: 0.14.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/21880

Description

In 0.13, the conversion of a pandas DataFrame's RangeIndex changed: it is no longer serialized as an actual column in the arrow table, but only saved as metadata (in the pandas metadata) (~~ARROW-1639~~).

This change lead to a couple of issues:

It can sometimes be unpredictable in pandas when you have a RangeIndex and when not. Which means that the resulting schema in arrow can be somewhat unexpected. See ~~ARROW-5104~~: empty DataFrame has RangeIndex or not depending on how it was created
The metadata is not always enough (or not updated) to reconstruct it when the table has been modified / subsetted.
For example, ~~ARROW-5138~~: retrieving a single row group from parquet file doesn't restore index properly (since the RangeIndex metadata was for the full table, not this subset)
And another one, ARROW-5139: empty column selection no longer restores index.

I think we should decide if we either want to try to fix those (or give an option to avoid those issues), or either close those as "won't fix".

One idea I had that could potentially alleviate some of those issues:

Make it possible for the user to still force actual serialization of the index, always, even if it is a RangeIndex.
To not introduce a new option, we could reuse the preserve_index keyword: change the default to None (which means the current behaviour), and change True to mean "always serialize" (although this is not fully backwards compatible with 0.13.0 for those users who explicitly specified the keyword).

I am not sure this is worth the added complexity (although I personally like providing the option where the index is simply always serialized as columns, without surprises). But ideally we decide on it for 0.14, to either fix or close the mentioned issues.

Attachments

Issue Links

relates to

ARROW-5139 [Python/C++] Empty column selection no longer restores index

Open

ARROW-5730 [Python][CI] Selectively skip test cases in the dask integration test

Resolved

ARROW-5138 [Python/C++] Row group retrieval doesn't restore index properly

Resolved

ARROW-5104 [Python/C++] Schema for empty tables include index column as integer

Closed

links to

GitHub Pull Request #4651

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 27/May/19 19:13

Updated:: 11/Jan/23 07:40

Resolved:: 27/Jun/19 07:15

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m