So if I were preparing an "executive summary", there would be several take-aways:
1> The number of update state operations, i.e. the number of times state is actually written to ZK is drastically lower under heavy load; by a factor of almost 400!
2> One implication here is that the number of state change notifications that ZK has to send out, and thus the number of times the state gets read by Solr nodes is also decreased by that same factor. So the fact that the state-read operations throughput is the same should be evaluated in light of the fact that there will be many fewer of them.
3> One thing not captured by the numbers is that the size of the Overseer queue is much less like to spin out of control due to both <2> and the fact that we're reading/ordering/processing batches of up to 10,000 messages at once.
4> Even though some of the throughput numbers haven't changed (am_i_leader for instance), they'll spend much less time waiting to be carried out due to 1-3. Plus only three points may make a circle, but isn't enough data to make a good generalization from
Is this fair? Accurate? Complete? I'm looking for something to present to those users who have seen the Overseer queue grow to the 100s of K, effectively making their cluster unusable.
Thanks for this work! As collections get larger and larger this has become a very significant pain-point.