Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15318

[C++][Python] Regression reading partition keys of large batches.

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      In a partitioned dataset with chunks larger than the default 1Mi batch size, reading only the partition keys is hanging, and consuming unbounded memory. The bug first appeared in nightly build `7.0.0.dev468`.

      In [1]: import pyarrow as pa, pyarrow.parquet as pq, numpy as np
      
      In [2]: pa.__version__
      Out[2]: '7.0.0.dev468'
      
      In [3]: table = pa.table({'key': pa.repeat(0, 2 ** 20 + 1), 'value': np.arange(2 ** 20 + 1)})
      
      In [4]: pq.write_to_dataset(table[:2 ** 20], 'one', partition_cols=['key'])
      
      In [5]: pq.write_to_dataset(table[:2 ** 20 + 1], 'two', partition_cols=['key'])
      
      In [6]: pq.read_table('one', columns=['key'])['key'].num_chunks
      Out[6]: 1
      
      In [7]: pq.read_table('two', columns=['key', 'value'])['key'].num_chunks
      Out[7]: 2
      
      In [8]: pq.read_table('two', columns=['key'])['key'].num_chunks
      zsh: killed     ipython # hangs; kllled
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            westonpace Weston Pace Assign to me
            coady A. Coady
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 3h
              3h

              Slack

                Issue deployment