Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17599

[C++] ReadRangeCache should not retain data after read

    XMLWordPrintableJSON

Details

    Description

      I've added a unit test of the issue here: https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention

      We use the ReadRangeCache for pre-buffering IPC and parquet files. Sometimes those files are quite large (gigabytes). The usage is roughly:

      for X in num_row_groups:
      CacheAllThePiecesWeNeedForRowGroupX
      WaitForPiecesToArriveForRowGroupX
      ReadThePiecesWeNeedForRowGroupX

      However, once we've read in row group X and passed it on to Acero, etc. we do not release the data for row group X. The read range cache's entries vector still holds a pointer to the buffer. The data is not released until the file reader itself is destroyed which only happens when we have finished processing an entire file.

      This leads to excessive memory usage when pre-buffering is enabled.

      This could potentially be a little difficult to implement because a single read range's cache entry could be shared by multiple ranges so we will need some kind of reference counting to know when we have fully finished with an entry and can release it.

      Attachments

        Issue Links

          Activity

            People

              aucahuasi Percy Camilo TriveƱo Aucahuasi
              westonpace Weston Pace
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h
                  4h