[ARROW-15318] [C++][Python] Regression reading partition keys of large batches. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 7.0.0
Fix Version/s: 7.0.0
Component/s: C++, Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/30806

Description

In a partitioned dataset with chunks larger than the default 1Mi batch size, reading only the partition keys is hanging, and consuming unbounded memory. The bug first appeared in nightly build `7.0.0.dev468`.

In [1]: import pyarrow as pa, pyarrow.parquet as pq, numpy as np

In [2]: pa.__version__
Out[2]: '7.0.0.dev468'

In [3]: table = pa.table({'key': pa.repeat(0, 2 ** 20 + 1), 'value': np.arange(2 ** 20 + 1)})

In [4]: pq.write_to_dataset(table[:2 ** 20], 'one', partition_cols=['key'])

In [5]: pq.write_to_dataset(table[:2 ** 20 + 1], 'two', partition_cols=['key'])

In [6]: pq.read_table('one', columns=['key'])['key'].num_chunks
Out[6]: 1

In [7]: pq.read_table('two', columns=['key', 'value'])['key'].num_chunks
Out[7]: 2

In [8]: pq.read_table('two', columns=['key'])['key'].num_chunks
zsh: killed     ipython # hangs; kllled

Attachments

Issue Links

links to

GitHub Pull Request #12147

Activity

People

Assignee:: Weston Pace

Reporter:: A. Coady

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Jan/22 05:09

Updated:: 11/Jan/23 11:36

Resolved:: 13/Jan/22 20:45

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: