Thanks Mike for testing. The flush times look better indeed!
Might you give us some advise on when to use this sort implementation?
I think radix sort is usually appealing as the complexity degrades more gracefully when the number of entries to sort increases. For keys that have a maximum length of k, its complexity is O(n*k) while a comparison-based sort has a complexity of O(n*k*log). So this change should make things even better with larger ram buffers (in addition to the fact that larger ram buffers mean less merging).
Like Yonik mentioned, an adversary case is when there are long common prefixes (because of the k parameter). The implementation has some protection against it though by forcing the fall back to intro sort after a given number of levels of recursion (currently 19, needs to be tuned) and recursing directly when all values fall into the same bucket for the k-th byte. Note that comparison-based sorts have the same adversary case since the comparisons need to scan these common prefixes as well, but it is less annoying in practice for BytesRefHash.sort() since the bottleneck is not the comparisons of values but 'get'ting the values to compare from the BytesRefHash (because of the random access pattern). Radix sort needs to call 'get' about 2*n*k times while introsort needs to call it about n*log times, which is why this fall back to introsort is still useful.
By the way, I forgot to mention an important implementation detail of this impl: since the bottleneck is to get values from the BytesRefHash, I added a cache for the first 3 buckets. This helps perform the first 3 levels of recursion by getting every value once instead of 2*3=6 when operating normally. I measured that this cache accounts for about 1/3 of the speedup, but it has the drawback of requiring an int of size n. I think it is fine though since this cache should be much smaller than the BytesRefHash itself.
Would it make sense for an implementation to extend o.a.l.u.Sorter
I can try but it could not be as generic as the current Sorter impls that mostly need a comparison function. To keep it efficient we would probably have to enforce a BytesRef representation of the keys.