Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14024

[C++] ScanOptions::batch_size not respected in parquet/IPC readers

    XMLWordPrintableJSON

Details

    Description

      At first glance it seems like Parquet's reader should work. The ScanOptions::batch_size property is forwarded into the ArrowReaderProperties for the parquet::arrow::FileReader. However, we then use ReadOneRowGroup which doesn't look at the batch_size option.

      The IPC reader simply doesn't look at the property at all.

      Even if we can't control the source read size (e.g. we have to read a full row group / record batch and have no control over its size) we can still split whatever we read into smaller batches that respect the batch size. This is important for achieving parallelism as we can then partition the CPU work across these batches.

      Attachments

        Issue Links

          Activity

            People

              lidavidm David Li
              westonpace Weston Pace
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m