Are you suggesting to not store collation keys in the index?
Right, I'm proposing storing the original Strings, but sorted
according Collator.compare (for that one field), in the Terms dict.
The query-time process in this patch is not the reverse - it is exactly the same.
OK got it. Where/how would you implement the query time conversion of
And wouldn't there be times when you also want to reverse the
encoding? EG if you enum all terms for presentation (maybe as part of
faceted search for example)?
In the current code base, for range searching on a collated field, every single term has to be collated with the search term. This patch allows skipTo to function when using collation, potentially providing a significant speedup.
Both the original proposed approach (external-to-indexing) and this
internal-to-indexing approach would solve this, right? Ie, in both
cases the terms have been sorted according to the Collator, but in the
internal-to-indexing case it's the original String term stored in the
Here are some pros of internal-to-indexing:
- You don't have to convert every single term visited during
analysis first to a CollationKey then ByteBuffer then encoded
binary string. Indexing throughput should be faster? (Though,
when writing the segment you do need to sort using
Collator.compare, which I guess could be slow).
- Real terms are stored in the index – tools like Luke can look at
the index and see normal looking terms. Though... I don't have a
sense of what the encoded term would look like – maybe it's not
that different from the original in practice?
- Querying would just work without term conversion
And some cons:
- It's obviously a more invasive change to Lucene (and probably
should go after the flex-indexing changes). The
external-to-indexing approach is nicely externalized.
- Performance – the binary search of the terms index would be
slower using Collator.compare instead of String.compareTo (though
I would expect this to be minimal in practice).
I'm sure there are many pros/cons I'm missing...