[ARROW-5666] [Python] Underscores in partition (string) values are dropped when reading dataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.13.0
Fix Version/s: 1.0.0
Component/s: Python
Labels:
- dataset-parquet-read
- parquet

External issue URL:
https://github.com/apache/arrow/issues/22099

Description

When reading a partitioned dataset, in which the partition column contains string values with underscores, pyarrow seems to be ignoring the underscores in the resulting values.

For example if I write and then read a dataset as follows:

import pyarrow as pa
import pandas as pd

df = pd.DataFrame({
    "year_week": ["2019_2", "2019_3"],
    "value": [1, 2]
})

table = pa.Table.from_pandas(df.head())
pq.write_to_dataset(table, 'test', partition_cols=["year_week"])

table2 = pq.ParquetDataset('test').read()

The resulting 'year_week' column in table 2 has lost the underscores:

table2[1] # Gives:

<Column name='year_week' type=DictionaryType(dictionary<values=int64, indices=int32, ordered=0>)>
[

  -- dictionary:
    [
      20192,
      20193
    ]
  -- indices:
    [
      0
    ],

  -- dictionary:
    [
      20192,
      20193
    ]
  -- indices:
    [
      1
    ]
]

Is this intentional behaviour or is this a bug in arrow?

Attachments

Issue Links

depends upon

ARROW-8039 [Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim

Resolved

relates to

ARROW-6114 [Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset

Open

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Julian de Ruiter

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/Jun/19 11:48

Updated:: 11/Jan/23 07:41

Resolved:: 04/May/20 23:43