Thanks Daryn Sharp! Here is the issue we had that motivates this jira, but after offline discussion with Chris Li and team members, we feel like tuning FairCallQueue configs should achieve the same result.
With FariCallQueue and backoff, we don't get much complaints regarding one abusive user's impact on other users. The main issue we currently have is a heavy user's impact on datanode service rpc requests which has been increasing as we continue to expand our cluster size. FairCallQueue is only for client RPC, not for datanode RPC. There was some discussion in HADOOP-10599 about this. Specifically:
- A heavy user generates lots of rpc requests, but it only filled up 1/4 of the lowest priority sub queue. However that is enough to cause lock contention with DN RPC requests.
- So to have backoff kick in sooner for the heavy user, we can reduce the rpc sub queue length. But that will impact all rpc sub queues.
- After the call queue length reduction, if lots of light users belonging to p0 come in at the same time, some light users will get backed off, given p0 sub queue is much smaller than before. Thus if it can overflow to the next queue, light users at least won't get backed off.
However, several configs tuning including client and service rpc handler count and FairCallQueue weight adjustment should be able to achieve the same result.
On a related note, if FairCallQueue is used but backoff is disabled, as mentioned in the description, put method will move on to the next queue until it lands on the last queue. It isn't clear why it can't just block on the corresponding sub queue instead. In other words, what is the reason overflow is useful for the block case, to reduce the chance that the reader threads being blocked? Still, it seems configs tuning can also achieve that, similar to the argument for the backoff case.