Stepping back a bit to think about the model. Correct me if you disagree:
Our end goal is that the reducers finish fetching reduce output as soon as possible after the last mapper finishes, but that the reducers are started as late as possible, so they don't occupy slots and hurt utilization.
So, let's assume that the mappers generate data at some rate M. The reducers can fetch data at some maximum rate R. It's the ratio of M/R that determines when to start the reducers fetching. For example, if the mappers generate data faster than the reducers can fetch it, it behooves us to start the reduce fetch immediately when the job starts. If the reducers can fetch twice as fast as the mappers can output, we want to start the reducers halfway through the map phase.
Since both kinds of tasks have some kind of startup cost, it's as if the average rate is slowed down by a factor that's determined by the number of tasks. In the case of 200 mappers and 1 reducers, it's as if the map output speed has been lowered (since the fixed costs of the map tasks slow down map completion), and thus we can afford to wait until later to start the reducer. If you have 1 mapper and 1 reducer, even for the exact same job, the ratio swings as if the map side output faster, and thus we want to start the reduce early.
This is of course a much simplified model, but I think it's worth discussing this on somewhat abstract terms before we discuss the implementation details. One factor I'm ignoring above is the limiting that the reducer does with respect to particular hosts - that is to say, the reducer fetch speed varies with the number of unique hosts, not just the number of mappers.