Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
7.0.0
Description
In a partitioned dataset with chunks larger than the default 1Mi batch size, reading only the partition keys is hanging, and consuming unbounded memory. The bug first appeared in nightly build `7.0.0.dev468`.
In [1]: import pyarrow as pa, pyarrow.parquet as pq, numpy as np In [2]: pa.__version__ Out[2]: '7.0.0.dev468' In [3]: table = pa.table({'key': pa.repeat(0, 2 ** 20 + 1), 'value': np.arange(2 ** 20 + 1)}) In [4]: pq.write_to_dataset(table[:2 ** 20], 'one', partition_cols=['key']) In [5]: pq.write_to_dataset(table[:2 ** 20 + 1], 'two', partition_cols=['key']) In [6]: pq.read_table('one', columns=['key'])['key'].num_chunks Out[6]: 1 In [7]: pq.read_table('two', columns=['key', 'value'])['key'].num_chunks Out[7]: 2 In [8]: pq.read_table('two', columns=['key'])['key'].num_chunks zsh: killed ipython # hangs; kllled
Attachments
Issue Links
- links to