Maybe these utility methods could be moved to oal.util.BitUtil where we already have some utility methods to count or find bits?
I kept them separate to have a clear reference to the Vigna broadword article.
The class javadoc already refers to all methods and constants from there, so that could be easily merged into BitUtil's class javadoc when merging the two classes.
Also both classes have only static attributes.
So merging is no problem at all, but I'm still in two minds about this.
Any opinions either way, i.e. merging class BroadWord here into the existing BitUtil or not?
I ran quick benchmarks for select and select9 seemed to quickly outperform selectNaive, even for non-dense longs. Do you have the same experience with selectNaive vs. select9? If yes, maybe selectNaive could be moved to the test case.
In the Quasi Succinct Indices article (EliasFanoDocIdSet,
LUCENE-5084), in Section 9 under Reading Unary Codes, Vigna mentions that within the right longword:
Our experiments show that broadword bit search is extremely effective, unless the number of reads is very small, as in that case computing iteratively the least significant bit becomes competitive. Indeed, when skipping a very small number of position (e.g., less then eight) we simply resort to iterating through the list.
In the test code there is a comment that select9 is about 7 times faster than selectNaive, 45ns against 295 ns for a very dense case on my old 32 bit home machine. Since the machine is 32 bits, that is no more than a rough indication that the selectNaive here (that iterates) might indeed preferable over the select9 here when selecting before the 8th set bit.
A safe conclusion is that moving selectNaive to the test cases now would be premature.
Since the main goal of rank9 is code documentation, maybe it should be made package-private?
I have not actually benchmarked rank9 it against Long.bitCount, but I think we should do that just to be sure that rank9 is slower, and than it can be made package-private.
As for longHex – I always liked the "assembly" version better.
I have used that mostly during debugging, and it ended up also in toString in EliasFanoEncoder iirc.
How about putting the assembly version in BitUtil?
Could we extend LuceneTestCase for tests, please?
I don't really like it, but my next patch here will have that.
Should LuceneTestCase also be mentioned in the wiki at "How to contribute"?