Dividing by shard count is fairly risky. An example could be the shards
Shard 1: A(9) B(6) C(10) D(8)
Shard 2: A(4) B(5) C(4) D(3)
where the request of the top-3 elements with mincount=5 from each shard would give the merged result
where the correct result would be
The problem with using mincount=1 for each shard-call is of course that the single shard result sets needs to be humongous in order to ensure that the correct values are returned, when the field contains many value with low count and few values with high count. With shards like
Shard 1: A(1) B(1) C(1) D(1) E(1) F(9) G(1) H(1)
Shard 2: A(1) B(1) C(1) D(1) E(1) F(1) G(1) H(10)
and a request for mincount=10, all terms must be returned from both shards in order to get the result
As you, Yonik, point out, a variant of the problem exists when sorting on count. However, for count it is mitigated by the fact that the results from the individual shards are sorted by the selecting key (count). This means that the chance of missing or miscounting tags is low and can be lowered further by relatively little over-requesting.
With lexical sorting, the selecting key (count again) is independent of the sorting key. Over-requesting helps, but only linear to the fraction of the full result-set from each shard that is requested. Furthermore, the need for over-requesting grows with the number of shards as the overlapping hills can be smaller while still summing up to mincount.
I do not have any real solution for the problem. One minor improvement would be a collector that kept collecting terms with a mincount=y until limit=n or the number of collected terms with mincount=x was equal to m, where x is the original mincount and y is dependent on the number of shards. This would at least stop the collection process when the result set was guaranteed to contain enough values above the given threshold. It would work well with spikes but poorly with hills just below mincount x and it would still not guarantee correctness of the sums of the counts, only correctness of the terms.