Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14648

[C++][Dataset] Change scanner readahead limits to be based on bytes instead of number of batches

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++

    Description

      In the scanner readahead is controlled by "batch_readahead" and "fragment_readahead" (both specified in the scan options).  This was mainly motivated on my work with CSV and the defaults of 32 and 8 will cause the scanner to buffer ~256MB of data (given the default block size of 1MB).

      For parquet / IPC this would mean we are buffering 256 row groups which is entirely too high.

      Rather than make users figure out complex parameters we should have a single readahead limit that is specified in bytes.

      This will be "best effort".  I'm not suggest we support partial reads of row groups / record batches so if the limit is set very small we still might end up with more in RAM just because we can only load entire row groups.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              westonpace Weston Pace
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: