Details
Description
Switching DocIdSetBuilder to radix sort helped make things significantly faster. We should be able to do the same with BytesRefHash.sort()?

 ByteBlockListSorter.java
 9 kB
 Dawid Weiss

 LUCENE7299.patch
 12 kB
 Adrien Grand

 LUCENE7299.patch
 13 kB
 Adrien Grand
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
Cool! Have you tried it w/ real docs, e.g. wikipedia? I can test it ...
Cool indeed! Might you give us some advise on when to use this sort implementation? The Wikipedia page on radix sort is an interesting read. So this is just for integers. Some more javadocs and implementation comments within LSBRadixSorter would be helpful. Would it make sense for an implementation to extend o.a.l.u.Sorter?
I think there is likely a real speedup here ... 3 runs each on trunk (before) vs patch (after), indexing full wikipedia:
mike@beast2:/l$ grep finished trunk/lucene/before?.log trunk/lucene/before1.log:Indexer: finished (126265 msec), excluding commit trunk/lucene/before2.log:Indexer: finished (124435 msec), excluding commit trunk/lucene/before3.log:Indexer: finished (122559 msec), excluding commit mike@beast2:/l$ grep finished radix/lucene/after?.log radix/lucene/after1.log:Indexer: finished (116234 msec), excluding commit radix/lucene/after2.log:Indexer: finished (120537 msec), excluding commit radix/lucene/after3.log:Indexer: finished (121848 msec), excluding commit
And when I look specifically at time to flush postings (first 30 segments flushed):
Before:
IW 0 [20160524T14:54:59.531Z; Index #0]: 3947 msec to write postings and finish vectors IW 0 [20160524T14:55:00.087Z; Index #10]: 3719 msec to write postings and finish vectors IW 0 [20160524T14:55:00.825Z; Index #16]: 3878 msec to write postings and finish vectors IW 0 [20160524T14:55:01.777Z; Index #12]: 3970 msec to write postings and finish vectors IW 0 [20160524T14:55:02.792Z; Index #13]: 3943 msec to write postings and finish vectors IW 0 [20160524T14:55:03.121Z; Index #5]: 3514 msec to write postings and finish vectors IW 0 [20160524T14:55:04.591Z; Index #19]: 3611 msec to write postings and finish vectors IW 0 [20160524T14:55:06.264Z; Index #15]: 4315 msec to write postings and finish vectors IW 0 [20160524T14:55:07.184Z; Index #3]: 4404 msec to write postings and finish vectors IW 0 [20160524T14:55:08.093Z; Index #21]: 4408 msec to write postings and finish vectors IW 0 [20160524T14:55:09.199Z; Index #23]: 4812 msec to write postings and finish vectors IW 0 [20160524T14:55:10.735Z; Index #17]: 5275 msec to write postings and finish vectors IW 0 [20160524T14:55:12.325Z; Index #4]: 5796 msec to write postings and finish vectors IW 0 [20160524T14:55:13.888Z; Index #9]: 6017 msec to write postings and finish vectors IW 0 [20160524T14:55:14.873Z; Index #3]: 5742 msec to write postings and finish vectors IW 0 [20160524T14:55:15.738Z; Index #10]: 5394 msec to write postings and finish vectors IW 0 [20160524T14:55:18.286Z; Index #7]: 6125 msec to write postings and finish vectors IW 0 [20160524T14:55:20.156Z; Index #2]: 6424 msec to write postings and finish vectors IW 0 [20160524T14:55:21.925Z; Index #20]: 6640 msec to write postings and finish vectors IW 0 [20160524T14:55:23.447Z; Index #12]: 6827 msec to write postings and finish vectors IW 0 [20160524T14:55:25.251Z; Index #21]: 6977 msec to write postings and finish vectors IW 0 [20160524T14:55:27.426Z; Index #15]: 7459 msec to write postings and finish vectors IW 0 [20160524T14:55:29.341Z; Index #16]: 7550 msec to write postings and finish vectors IW 0 [20160524T14:55:30.548Z; Index #11]: 6981 msec to write postings and finish vectors IW 0 [20160524T14:55:30.973Z; Index #14]: 5968 msec to write postings and finish vectors IW 0 [20160524T14:55:32.538Z; Index #5]: 6209 msec to write postings and finish vectors IW 0 [20160524T14:55:33.962Z; Index #21]: 6256 msec to write postings and finish vectors IW 0 [20160524T14:55:35.243Z; Index #1]: 6219 msec to write postings and finish vectors IW 0 [20160524T14:55:36.786Z; Index #7]: 6089 msec to write postings and finish vectors IW 0 [20160524T14:55:38.149Z; Index #16]: 5999 msec to write postings and finish vectors
after:
IW 0 [20160524T14:40:16.296Z; Index #8]: 2706 msec to write postings and finish vectors IW 0 [20160524T14:40:17.174Z; Index #19]: 2977 msec to write postings and finish vectors IW 0 [20160524T14:40:17.651Z; Index #3]: 2717 msec to write postings and finish vectors IW 0 [20160524T14:40:18.731Z; Index #10]: 3010 msec to write postings and finish vectors IW 0 [20160524T14:40:19.624Z; Index #6]: 2975 msec to write postings and finish vectors IW 0 [20160524T14:40:20.264Z; Index #9]: 2770 msec to write postings and finish vectors IW 0 [20160524T14:40:21.297Z; Index #3]: 2559 msec to write postings and finish vectors IW 0 [20160524T14:40:22.598Z; Index #5]: 2958 msec to write postings and finish vectors IW 0 [20160524T14:40:23.757Z; Index #19]: 3046 msec to write postings and finish vectors IW 0 [20160524T14:40:24.554Z; Index #22]: 3098 msec to write postings and finish vectors IW 0 [20160524T14:40:25.727Z; Index #9]: 3412 msec to write postings and finish vectors IW 0 [20160524T14:40:26.980Z; Index #14]: 3728 msec to write postings and finish vectors IW 0 [20160524T14:40:28.440Z; Index #1]: 4251 msec to write postings and finish vectors IW 0 [20160524T14:40:29.723Z; Index #12]: 4361 msec to write postings and finish vectors IW 0 [20160524T14:40:31.040Z; Index #8]: 4516 msec to write postings and finish vectors IW 0 [20160524T14:40:31.935Z; Index #2]: 4107 msec to write postings and finish vectors IW 0 [20160524T14:40:33.631Z; Index #20]: 4414 msec to write postings and finish vectors IW 0 [20160524T14:40:34.688Z; Index #4]: 4165 msec to write postings and finish vectors IW 0 [20160524T14:40:36.060Z; Index #18]: 4158 msec to write postings and finish vectors IW 0 [20160524T14:40:37.625Z; Index #23]: 4287 msec to write postings and finish vectors IW 0 [20160524T14:40:39.622Z; Index #7]: 4861 msec to write postings and finish vectors IW 0 [20160524T14:40:41.147Z; Index #0]: 4976 msec to write postings and finish vectors IW 0 [20160524T14:40:42.833Z; Index #8]: 5074 msec to write postings and finish vectors IW 0 [20160524T14:40:44.399Z; Index #21]: 5254 msec to write postings and finish vectors IW 0 [20160524T14:40:44.734Z; Index #14]: 3862 msec to write postings and finish vectors IW 0 [20160524T14:40:46.047Z; Index #0]: 4107 msec to write postings and finish vectors IW 0 [20160524T14:40:47.227Z; Index #2]: 4151 msec to write postings and finish vectors IW 0 [20160524T14:40:48.884Z; Index #20]: 4571 msec to write postings and finish vectors IW 0 [20160524T14:40:49.918Z; Index #9]: 4350 msec to write postings and finish vectors IW 0 [20160524T14:40:51.088Z; Index #13]: 4221 msec to write postings and finish vectors
So this is just for integers.
No, MSB variations are fine for strings... you go bytebybyte, just as you would for integers, with each byte determining the appropriate bucket.
I imagine this would have a weakness for large common prefixes. Random strings may be the best case since it equally distributed between buckets (hence does a maximum amount of work on each pass).
edit: I just peeked at Adrien's code, and the commonprefix thing should be handled relatively well:
+ // shortcircuit: if all keys have the same byte at offset k, then recurse directly
Thanks Mike for testing. The flush times look better indeed!
Might you give us some advise on when to use this sort implementation?
I think radix sort is usually appealing as the complexity degrades more gracefully when the number of entries to sort increases. For keys that have a maximum length of k, its complexity is O(n*k) while a comparisonbased sort has a complexity of O(n*k*log). So this change should make things even better with larger ram buffers (in addition to the fact that larger ram buffers mean less merging).
Like Yonik mentioned, an adversary case is when there are long common prefixes (because of the k parameter). The implementation has some protection against it though by forcing the fall back to intro sort after a given number of levels of recursion (currently 19, needs to be tuned) and recursing directly when all values fall into the same bucket for the kth byte. Note that comparisonbased sorts have the same adversary case since the comparisons need to scan these common prefixes as well, but it is less annoying in practice for BytesRefHash.sort() since the bottleneck is not the comparisons of values but 'get'ting the values to compare from the BytesRefHash (because of the random access pattern). Radix sort needs to call 'get' about 2*n*k times while introsort needs to call it about n*log times, which is why this fall back to introsort is still useful.
By the way, I forgot to mention an important implementation detail of this impl: since the bottleneck is to get values from the BytesRefHash, I added a cache for the first 3 buckets. This helps perform the first 3 levels of recursion by getting every value once instead of 2*3=6 when operating normally. I measured that this cache accounts for about 1/3 of the speedup, but it has the drawback of requiring an int[] of size n. I think it is fine though since this cache should be much smaller than the BytesRefHash itself.
Would it make sense for an implementation to extend o.a.l.u.Sorter
I can try but it could not be as generic as the current Sorter impls that mostly need a comparison function. To keep it efficient we would probably have to enforce a BytesRef representation of the keys.
We have an internal radix sort for sorting (largeish) key buffers (I attach it and hereby put it in the public domain). It was always a gain over anything else we tried.
As for common prefixes – the nice part about it is that when you're descending into same prefix blocks you can disregard those prefixes in comparisons (knowing they have to be equal). We thus "pushed down" the comparison routine to byte block buffers – see reference to compareAssumingEqualPrefix inside the code. There is also a hook inside byte block list to allow you to retrieve a single byte at a given offset so there's no need to copy keys over and over again (.byteAt).
A simple insertion sort at the "leaf" level was always better than trying to stop radix sort earlier. This result was observed was for realistic keys, not random distributions.
All these optimizations yielded significant speed boost for us, perhaps they'll be an inspiration of some sort, Adrien.
Thanks Dawid for sharing your implementation and experience! It looks very similar to the patch except the redistribution logic and the fact that your imp has the ability to parallelize in a ForkJoinPool. I tried to replace the redistribution logic out of curiosity but performance was the same.
when you're descending into same prefix blocks you can disregard those prefixes in comparisons
The patch already does this, I agree this is an important optimization.
There is also a hook inside byte block list to allow you to retrieve a single byte at a given offset so there's no need to copy keys over and over again (.byteAt).
I think it should be fine with BytesRefHash since it just returns a BytesRef that points to an internal structure rather than copying bytes. Adding a byteAt method might help further optimize it but I'd rather not have to add APIs to BytesRefHash for now.
Updated patch. It makes the impl extend Sorter like David suggested and removes the cache (so it is a bit slower than the previous patch) to keep things simpler/safer for a first iteration.
No problem at all, Adrien. Sorry for not reading your patch in detail – I just skimmed through quickly since it's something we had worked on before.
[...] except the redistribution logic and the fact that your imp has the ability to parallelize in a ForkJoinPool. I tried to replace the redistribution logic out of curiosity but performance was the same.
Both are needed because our sorted sets are much, much larger than typical Lucene buffers. When you have millions of (smallish) entries to sort the redistribution index took a lot of extra space – it was never a performance win, it was a memory conservative strategy.
I think it should be fine with BytesRefHash since it just returns a BytesRef that points to an internal structure rather than copying bytes.
Again, this was a significant performance boost in our case because of the size of structures we sort – we also didn't copy the content of strings, but even filling in the pointer and length in a reused "pointerlike" class (much like BytesRef) was quite costly. There is also a related issue of avoiding extra allocations in BytesRefHash that I filed a while ago – if you're working on that piece of code you may be interested in looking at it (LUCENE5854).
Thanks for the pointer. Given that BytesRefHash seems to be an indexing bottleneck, I'll add it to my list of things to look at.
Commit 1f105ed9bf3f0c16b432a475fb5553e16a4e46c9 in lucenesolr's branch refs/heads/branch_6x from Adrien Grand
[ https://gitwipus.apache.org/repos/asf?p=lucenesolr.git;h=1f105ed ]
LUCENE7299: Speed up BytesRefHash.sort.
Commit be597a81ef516a8630f4488c011ddf66dda3c771 in lucenesolr's branch refs/heads/master from Adrien Grand
[ https://gitwipus.apache.org/repos/asf?p=lucenesolr.git;h=be597a8 ]
LUCENE7299: Speed up BytesRefHash.sort.
In the patch you committed, I noticed the inner subclass implementation of IntroSorter has a swap() that simply calls StringMSBRadixSorter.this.swap(). That got me thinking, couldn't compare() do likewise? Just a minor point of course.
An observation here: StringMSBRadixSorter is a radix sort that falls back on IntroSorter which is in turn a quick sort that falls back on either insertionSort or heapSort. Wow... maybe we need to work in some more Or maybe reconsider if fewer is better or when we switch. For example StringMSBRadixSorter switches to IntroSort to cap the recursion levels (currently at 8), and for other reason too granted, but IntroSort uses recursion too. The Sorter.THRESHOLD constant confused me temporarily; perhaps "INSERTION_SORT_THRESHOLD" would be less ambiguous in a class hierarchy of sorters with various thresholds.
That got me thinking, couldn't compare() do likewise?
compare() is not needed for radix sort so I think it is less confusing to have it only for the introsort fallback?
Or maybe reconsider if fewer is better or when we switch
In general abstractions tend to make things slower but in that case I think we are fine since radix sort would still do most of the work. The cap on the recursion level does not aim at keeping call stacks small but at switching to introsort when there are long common prefixes, since introsort performed better in that case in my experiments. I will add a comment about it.
The Sorter.THRESHOLD constant confused me temporarily; perhaps "INSERTION_SORT_THRESHOLD" would be less ambiguous in a class hierarchy of sorters with various thresholds.
+1 to rename
Here is a patch. I ran a simple benchmark with 10M docs, a single thread, a single indexed field and a largish ram buffer of 500 MB. On random binary keys of length 16, indexing speed improved by 37%. On random ascii keys whose length is between 0 and 16, indexing speed improved by 34%.
This is not a realistic speedup since there is no merging involved (the ram buffer can hold everything), no stored fields, no doc values, etc.  but I think this is still interesting to be able to generate the inverted index more quickly for flushed segments. There will probably be a noticeable speedup with some setups that make heavy use of the inverted index of highcardinality fields with large ram buffers.