Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.13.0
Description
When reading a partitioned dataset, in which the partition column contains string values with underscores, pyarrow seems to be ignoring the underscores in the resulting values.
For example if I write and then read a dataset as follows:
import pyarrow as pa import pandas as pd df = pd.DataFrame({ "year_week": ["2019_2", "2019_3"], "value": [1, 2] }) table = pa.Table.from_pandas(df.head()) pq.write_to_dataset(table, 'test', partition_cols=["year_week"]) table2 = pq.ParquetDataset('test').read()
The resulting 'year_week' column in table 2 has lost the underscores:
table2[1] # Gives:
<Column name='year_week' type=DictionaryType(dictionary<values=int64, indices=int32, ordered=0>)>
[
-- dictionary:
[
20192,
20193
]
-- indices:
[
0
],
-- dictionary:
[
20192,
20193
]
-- indices:
[
1
]
]
Is this intentional behaviour or is this a bug in arrow?
Attachments
Issue Links
- depends upon
-
ARROW-8039 [Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim
-
- Resolved
-
- relates to
-
ARROW-6114 [Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset
-
- Open
-