[ARROW-9573] [Python] Parquet doesn't load when partitioned column starts with '_' - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 1.0.1, 2.0.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/25639

Description

When the loading parquet with partitioned column that starts with an underscore '_', nothing is loaded. No exceptions are raised either. Loading this parquet have worked for me in pyarrow 0.17.1, but not working anymore in pyarrow 1.0.0.

On the other hand, loading parquet with a partitioned column starting with '_' is possible by using the `use_legacy_dataset` option. Also, when the column that starts with an underscore is not a partitioned column, loading parquet seems to work as expected.

>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import pandas as pd
>>> df1 = pd.DataFrame(data={'_COL_1': [1, 2], 'COL_2': [3, 4], 'COL_3': [5, 6]})
>>> table1 = pa.Table.from_pandas(df1)
>>> pq.write_to_dataset(table1, partition_cols=['_COL_1', 'COL_2'], root_path='test_parquet1')
>>> df_pq1 = pq.read_table('test_parquet1')
>>> df_pq1
pyarrow.Table
>>> len(df_pq1)
0
>>> df_pq1_legacy = pq.read_table('test_parquet1', use_legacy_dataset=True)
pyarrow.Table
COL_3: int64
_COL_1: dictionary<values=int64, indices=int32, ordered=0>
COL_2: dictionary<values=int64, indices=int32, ordered=0>
>>> len(df_pq1_legacy)
2
>>> df2 = pd.DataFrame(data={'COL_1': [1, 2], 'COL_2': [3, 4], '_COL_3': [5, 6]})
>>> table2 = pa.Table.from_pandas(df2)
>>> pq.write_to_dataset(table2, partition_cols=['COL_1', 'COL_2'], root_path='test_parquet2')
>>> df_pq2 = pq.read_table('test_parquet2')
>>> df_pq2
pyarrow.Table
_COL_3: int64
COL_1: int32
COL_2: int32
>>> len(df_pq2)
2

Attachments

Issue Links

links to

GitHub Pull Request #7900

Activity

People

Assignee:: Ben Kietzman

Reporter:: Tonnam Balankura

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/Jul/20 19:00

Updated:: 11/Jan/23 08:07

Resolved:: 06/Aug/20 16:11

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h