We've seen cases where simple count queries over tables with a few thousand partitions are slow to start up, when the cluster is over a hundred nodes or so.
There are several problems, but here's the most significant: Impala in version 2.3.0 and earlier starts its fragments ‘top-down’ - i.e. a fragment is started before all fragments that send to it are. This means starting the coordinator fragment, then all the fragments at depth 1 in the plan tree that send to the coordinator fragment, then all the fragments at depth 2 and so on. Otherwise a fragment that starts before its recipients do will fail as soon as it transmits data, because the receiving Impala server does not have knowledge of the receiving fragment and returns an error.
This means that the coordinator performs N rounds of fragment start-up, where N is the number of plan fragments in the plan tree (because even within a level, fragment dispatch is serialised left-to-right). Queries often do not start doing useful work (i.e. scans or other leaf operators) until the 2nd or later round of RPCs. Parallelism is lost during RPC dispatch as well, as skew in large clusters usually means that the latency of each round is dominated by one or two stragglers, during which time the coordinator is effectively paused (and could be usefully sending more fragment exec RPCs).