[ARROW-7782] [Python] Losing index information when using write_to_dataset with partition_cols - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: Python
Labels:
None
Environment:
pyarrow==0.15.1

Flags:

Important
External issue URL:
https://github.com/apache/arrow/issues/24016

Description

One cannot save the index when using pyarrow.parquet.write_to_dataset() with given partition_cols arguments. Here I have created a minimal example which shows the issue:

 
from pathlib import Path
import pandas as pd
from pyarrow import Table
from pyarrow.parquet import write_to_dataset, read_table

path = Path('/home/user/trials')
file_name = 'local_database.parquet'
df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b']}, 
                  index=pd.Index(['a', 'b', 'c'], 
                  name='idx'))

table = Table.from_pandas(df)
write_to_dataset(table, 
                 str(path / file_name), 
                 partition_cols=['B']
                )
df_read = read_table(str(path / file_name))
df_read.to_pandas()

The issue is rather important for pandas and dask users.

Attachments

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Ludwik Bielczynski

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 06/Feb/20 15:07

Updated:: 11/Jan/23 07:55

Resolved:: 29/Apr/20 02:43