For a bit more context, we had about ~6-7k tasks (erroneously) issuing listLocatedStatus. Each limited response was over 1M. The handler attempts a non-blocking write for the response. If the entire response cannot be written, the call is added to the background responder thread. The kernel accepts well below 1M for a non-blocking write so all the responses were added to the responder thread.
The call response byte buffers track the position of the last write, thus the entire response buffer is retained until the full response is sent. Re-allocating a buffer with the unsent response will likely introduce additional memory pressure, so the most logical/simplistic change is limiting the response size of the located status.
The end result in our case was the heap bloating by over 8G. Full GC kicked in. The NN was unresponsive for up to 5m at a time. Each time it woke up it marked DNs as dead, causing a flurry of replications which further aggravated the memory issue. Due to other exposed bugs, the NN required a restart.
Although more RPCs are required to satisfy the large requests, I believe the tradeoff is reasonable. It's also not likely to be a common occurrence.