Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15318

[C++][Python] Regression reading partition keys of large batches.

    XMLWordPrintableJSON

Details

    Description

      In a partitioned dataset with chunks larger than the default 1Mi batch size, reading only the partition keys is hanging, and consuming unbounded memory. The bug first appeared in nightly build `7.0.0.dev468`.

      In [1]: import pyarrow as pa, pyarrow.parquet as pq, numpy as np
      
      In [2]: pa.__version__
      Out[2]: '7.0.0.dev468'
      
      In [3]: table = pa.table({'key': pa.repeat(0, 2 ** 20 + 1), 'value': np.arange(2 ** 20 + 1)})
      
      In [4]: pq.write_to_dataset(table[:2 ** 20], 'one', partition_cols=['key'])
      
      In [5]: pq.write_to_dataset(table[:2 ** 20 + 1], 'two', partition_cols=['key'])
      
      In [6]: pq.read_table('one', columns=['key'])['key'].num_chunks
      Out[6]: 1
      
      In [7]: pq.read_table('two', columns=['key', 'value'])['key'].num_chunks
      Out[7]: 2
      
      In [8]: pq.read_table('two', columns=['key'])['key'].num_chunks
      zsh: killed     ipython # hangs; kllled
      

      Attachments

        Issue Links

          Activity

            People

              westonpace Weston Pace
              coady A. Coady
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h
                  3h