Here are some additional details on the scenario that prompted filing this issue. Thanks to Gopal V for sharing the details.
Gopal has a YARN application that performs strictly sequential reads of HDFS files. The application may rapidly iterate through a large number of blocks. The reason for this is that each block contains a small metadata header, and based on the contents of this metadata, the application often can decide that there is nothing relevant in the rest of the block. If that happens, then the application seeks all the way past that block. Gopal estimates that it's feasible this code would scan through ~100 HDFS blocks in ~10 seconds.
This usage pattern in combination with zero-copy read causes retention of a large number of memory-mapped regions in the ShortCircuitCache. Eventually, YARN's resource check kills the container process for exceeding the enforced physical memory bounds. The asynchronous nature of our munmap calls was surprising for Gopal, who had carefully calculated his memory usage to stay under YARN's resource checks.
As a workaround, I advised Gopal to downtune dfs.client.mmap.cache.timeout.ms to make the munmap happen more quickly. A better solution would be to provide support in the HDFS client for a caching policy that fits this usage pattern. Two possibilities are:
- LRU bounded by a client-specified maximum memory size. (Note this is maximum memory size and not number of files or number of blocks, because of the possibility of differing block counts and block sizes.)
- Do not cache at all. Effectively, there is only one memory-mapped region alive at a time. The sequential read usage pattern described above always results in a cache miss anyway, so a cache adds no value.
I don't propose removing the current time-triggered threshold, because I think that's valid for other use cases. I only propose adding support for new policies.
In addition to the caching policy itself, I want to propose a way to move the munmap calls to run synchronous with the caller instead of in a background thread. This would be a better fit for clients who want deterministic resource cleanup. Right now, we have no way to guarantee that the OS will schedule the CacheCleaner thread ahead of YARN's resource check thread. This isn't a proposal to remove support for the background thread, only to add support for synchronous munmap.
I think you could also make an argument that YARN shouldn't count these memory-mapped regions towards the container process's RSS. It's really the DataNode process that owns that memory, and clients who mmap the same region shouldn't get penalized. Let's address that part separately though.