Remote Kudu scans can take many iterations against the same scan range before achieving good performance if the OS buffer cache is initially cold on the tablet servers. The slow warmup of the buffer cache is exacerbated by the fact that remote scans in the default Impala config choose a tablet server at random from the replica candidates. The Kudu client supports a LEADER_ONLY option that provides hard affinity to the leader replica, and Impala allows this to be configured using the --pick_only_leaders_for_tests option, but this is currently considered a testing only option and by default Impala will connect to a random replica.
The following is a series of iterations of TPC-DS query 33 (times in seconds), against a freshly started Kudu cluster, in 3 configurations (1) local reads, with Impala running on Kudu cluster, (2) remote reads from separate Impala cluster with default config, (3) remote reads with pick_only_leaders_for_tests=true (LEADER_ONLY affinity)
|Config||Iteration 1||Iter 2||Iter 3||Iter 4||Iter 5||Iter 6||Iter 7||Iter 8||Iter 9|
|Remote (default config)||110.8||56.9||49.9||43.3||37.3||44.0||20.0||28.9||14.9|
With pick_only_leaders_for_tests, the remote performance improves quickly, approaching local performance on the second iteration and warming up fully by iteration 4. In the default config it takes 9 iterations of the query before we see the same performance.
Running similar experiments after explicitly dropping the buffer cache on the tablet servers confirmed that this slow warmup is caused by poor buffer cache hit rates until the cache is fully warm.
I suspect that slow warmup isn't the only consequence of this. Caching a given tablet in the buffer cache on multiple tablet servers increases the overall buffer cache footprint and will increase tserver memory pressure under load.
We should consider setting the LEADER_ONLY option by default for remote Kudu reads. The only concern would be that this might result in worse load balancing and hotspots, in which case Kudu might need to implement some additional connection option that provides a better mix of affinity and load balancing.