Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
ghx-label-9
Description
I'm doing this as part of the HDFS buffer management work but splitting it out as a subtask since it's a logically independent change.
ScannerContext currently depends on the scanners calling ReleaseCompletedResources() repeatedly to free up buffers. Currently this works ok, but if we add a hard constraint to the number of I/O buffers, then we could hit resource exhaustion if we scan too far ahead without calling ReleaseCompletedResources(). E.g. if we have 3 * 8MB I/O buffers to use and try to scan 25MB before calling ReleaseCompletedResources(), we end up in a state where all I/O buffers are sitting in the ScannerContext.
Certain ScannerContext operations also can exhaust the I/O buffers no matter how frequently ReleaseCompletedResources() is called. E.g. ReadBytes(25MB) or SkipBytes(25MB) would run into that problem with the current implementation.
I spent some time looking at the ScannerContext API and the calling patterns of the scanners and came to the conclusion that there's no requirement for us to accumulate buffers in completed_io_buffers_ - after IMPALA-5307 we don't generally assume that the memory returned from previous calls remains valid when the read position from the stream is advanced.