Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17299

[C++] [Python] Expose the Scanner kDefaultBatchReadahead and kDefaultFragmentReadahead parameters

    XMLWordPrintableJSON

Details

    Description

      In the Scanner there are parameters kDefaultFragmentReadahead and kDefaultBatchReadahead that are currently set to fixed numbers that cannot be changed.

      This is not great because tuning these numbers is the key to tradeoff RAM usage and network IO utilization during reading. For example on an i3.2xlarge instance on AWS you can get peak throughput only by quadrupling kDefaultFragmentReadahead from the default. 

      The current settings are very conservative and assume a < 1Gbps network. Exposing them allow people to tune the Scanner behavior to their own hardware. 

      Attachments

        Issue Links

          Activity

            People

              marsupialtail Ziheng Wang
              marsupialtail Ziheng Wang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 20m
                  2h 20m