[ARROW-17599] [C++] ReadRangeCache should not retain data after read - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: C++
Labels:
- good-second-issue
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/32846

Description

I've added a unit test of the issue here: https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention

We use the ReadRangeCache for pre-buffering IPC and parquet files. Sometimes those files are quite large (gigabytes). The usage is roughly:

for X in num_row_groups:
CacheAllThePiecesWeNeedForRowGroupX
WaitForPiecesToArriveForRowGroupX
ReadThePiecesWeNeedForRowGroupX

However, once we've read in row group X and passed it on to Acero, etc. we do not release the data for row group X. The read range cache's entries vector still holds a pointer to the buffer. The data is not released until the file reader itself is destroyed which only happens when we have finished processing an entire file.

This leads to excessive memory usage when pre-buffering is enabled.

This could potentially be a little difficult to implement because a single read range's cache entry could be shared by multiple ranges so we will need some kind of reference counting to know when we have fully finished with an entry and can release it.

Attachments

Issue Links

is related to

ARROW-18113 [C++] Implement a read range process without caching

Resolved

ARROW-17590 Lower memory usage with filters

Closed

links to

GitHub Pull Request #14226

Activity

People

Assignee:: Percy Camilo Triveño Aucahuasi

Reporter:: Weston Pace

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 02/Sep/22 02:28

Updated:: 11/Jan/23 11:51

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: