HDFS-11384, a mechanism was added to make the getBlocks RPC calls issued by the Balancer/Mover more dispersed, to alleviate load on the NameNode, since getBlocks can be very expensive and the Balancer should not impact normal cluster operation.
Unfortunately, this functionality does not function as expected, especially when the dispatcher thread count is low. The primary issue is that the delay is applied only to the first N threads that are submitted to the dispatcher's executor, where N is the size of the dispatcher's threadpool, but not to the first R threads, where R is the number of allowed getBlocks QPS (currently hardcoded to 20). For example, if the threadpool size is 100 (the default), threads 0-19 have no delay, 20-99 have increased levels of delay, and 100+ have no delay. As I understand it, the intent of the logic was that the delay applied to the first 100 threads would force the dispatcher executor's threads to all be consumed, thus blocking subsequent (non-delayed) threads until the delay period has expired. However, threads 0-19 can finish very quickly (their work can often be fulfilled in the time it takes to execute a single getBlocks RPC, on the order of tens of milliseconds), thus opening up 20 new slots in the executor, which are then consumed by non-delayed threads 100-119, and so on. So, although 80 threads have had a delay applied, the non-delay threads rush through in the 20 non-delay slots.
This problem gets even worse when the dispatcher threadpool size is less than the max getBlocks QPS. For example, if the threadpool size is 10, no threads ever have a delay applied, and the feature is not enabled at all.
This problem wasn't surfaced in the original JIRA because the test incorrectly measured the period across which getBlocks RPCs were distributed. The variables startGetBlocksTime and endGetBlocksTime were used to track the time over which the getBlocks calls were made. However, startGetBlocksTime was initialized at the time of creation of the FSNameystem spy, which is before the mock DataNodes are started. Even worse, the Balancer in this test takes 2 iterations to complete balancing the cluster, so the time period endGetBlocksTime - startGetBlocksTime actually represents:
Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen during the period of initial block fetching.