Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16294

[C++] Improve performance of parquet readahead

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 8.0.0
    • C++

    Description

      The 7.0.0 readahead for parquet would read up to 256 row groups at once which meant that, if the consumer were too slow, we would almost certainly run out of memory.

      ARROW-15410 improved readahead as a whole and, in the process, changed parquet so it's always reading 1 row group in advance.

      This is not always ideal in S3 scenarios. We may want to read many row groups in advance if the row groups are small. To fix this we should continue reading in parallel until there are at least batch_size * batch_readahead rows being fetched.

      Attachments

        Issue Links

          Activity

            People

              westonpace Weston Pace
              westonpace Weston Pace
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m