The problem we are facing: It takes a long time for all tasks of a job to get scheduled on the cluster, even if the cluster is almost empty.
There are two reasons that together lead to this situation:
1. The load factor makes sure each TT runs the same number of tasks. (This is the part that this patch tries to change).
2. The scheduler tries to schedule map tasks locally (first node-local, then rack-local). There is a wait time (mapred.fairscheduler.localitywait.node and mapred.fairscheduler.localitywait.rack, both are around 10 sec in our conf), and accumulated wait time (JobInfo.localityWait). The accumulated wait time is reset to 0 whenever a non-local map task is scheduled. That means it takes N * wait_time to schedule N non-local map tasks.
Because of 1, a lot of TT will not be able to take more tasks, even if they have free slots. As a result, a lot of the map tasks cannot be scheduled locally.
Because of 2, it's really hard to schedule a non-local task.
As a result, sometimes we are seeing that it takes more than 2 minutes to schedule all the mappers of a job.