|
at least in ICU, its not completely safe. If the different JVM instances are "different" in version (upgrade, etc) then it would be a shame to find your sorts all busted.
When comparing keys, it is important to know that both keys were generated by the same algorithms and weightings. Otherwise, identical strings with keys generated on two different dates, for example, might compare as unequal. Sort keys can be affected by new versions of ICU or its data tables, new sort key formats, or changes to the Collator. http://www.icu-project.org/userguide/Collate_ServiceArchitecture.html Three problems I can think of off the top of my head with attempting an automatically managed solution to the problem of CollationKey comparability:
So, it seems to me, either the user of this functionality has to manage the versioning external to the Lucene index, or they can't use the functionality Would strong warnings in the javadocs be enough to allow people to take appropriate precautions? One alternative is that the ICU implementation has versioning specifically for this purpose.
The version information of Collator is a 32-bit integer. If a new version of ICU has changes affecting the content of collation elements, the version information will be changed. In that case, to use the new version of ICU collator will require regenerating any saved or stored sort keys. However, since ICU 1.8.1. it is possible to build your program so that it uses more than one version of ICU. Therefore, you could use the current version for the features you need and use the older version for collation.
I agree with you on both points ... this is really just an extension of warning people to use compatible analyzers when indexing/querying. (I only brought it up in my first comment because i know very little about the internals of any Collator Implementations out there, and i wasn't sure if all Implementations produces keys that were only comparable between "same" instances .. as long as there are some implementations of Collator that products keys which can be compared between "equivalent" instances, then this feature certainly seems useful. Robert Muir wrote:
I'll look into using RegexQuery as a model here (it enables use of either java.util.regex or Jakarta Regexp, defaulting to java.util.regex), and try to add CollatorCapable/CollatorCapabilities, so that ICU's Collator implementation will be usable. Hoss wrote:
I will add warnings about this issue to the javadocs. Modifications in this patch:
The external ICU4J dependency, which should be checked into contrib/collation/lib/, can be downloaded here: http://download.icu-project.org/files/icu4j/4.0/icu4j-4_0.jar Could we, alternatively, push this change into DocumentsWriter, such that on writing a segment it uses a per-field Collator (FieldInfo would be extended to record this) to sort the terms dict?
I haven't fully thought through the tradeoffs... but it seems like this'd be simpler to use? Ie rather than putting a CollationKeyFilter in your analyzer chain, and then doing the reverse of this for all searches at search time, you simply set the Collator on the fields (at indexing & searching time, since I agree we should for now not try to serialize into the index which field has which Collator)? I guess there is a performance cost to using the Collator to do live binary search (during searching) and sorting (during indexing) vs doing unicode String comparisions but in practice at search time this is probably a tiny part of the net cost of searching? Hi Mike,
Are you suggesting to not store collation keys in the index?
The query-time process in this patch is not the reverse - it is exactly the same. The String-encoded collation keys stored in the index are compared directly with those from query terms. Neither the String-encoding nor the CollationKey needs to be reversed.
In the current code base, for range searching on a collated field, every single term has to be collated with the search term. This patch allows skipTo to function when using collation, potentially providing a significant speedup.
Right, I'm proposing storing the original Strings, but sorted
OK got it. Where/how would you implement the query time conversion of And wouldn't there be times when you also want to reverse the
Both the original proposed approach (external-to-indexing) and this Here are some pros of internal-to-indexing:
And some cons:
I'm sure there are many pros/cons I'm missing...
AFAIK, CollationKey generation is a one-way operation. If the original terms are required for presentation, they can be stored, right?
IndexableBinaryStringTools (
Perhaps I'm missing something, but o.a.l.index.TermEnum.skipTo(Term) compares the target term using String.compareTo(), so regardless of the index term dictionary ordering, skipTo() won't necessarily stop at the correct location, right? From TermEnum.java: public boolean skipTo(Term target) throws IOException { do { if (!next()) return false; } while (target.compareTo(term()) > 0); return true; } and here's o.a.l.index.Term.compareTo(Term): public final int compareTo(Term other) { if (field == other.field) // fields are interned return text.compareTo(other.text); else return field.compareTo(other.field); }
Oh OK. So having done this term conversion, you can't really look at / use the resulting terms in the index for human consumption (you'd have to store stuff yourself).
But we could just fix that to pay attention to the Collator for that field, if it has one, right? (Or with flexible indexing I think the impl really should own this method, ie, it should be abstract in TermEnum). I think the external approach is fine for starters... I just think long-term it may make sense to have core Lucene respect the Collator, but it really is an invasive change. We should wait until we make progress on flexible indexing at which point such a change should be far less costly.
Um, yes.
Now that I understand it, I too think the internal-to-indexing approach is cleaner/easier to use/better long-term. This patch is an attempt to improve on the performance of the range collation facilities introduced in Another use-case for allowing per-field custom sorting of Terms would be simpler numeric RangeQuery. Ie, right now you have to zero-pad numbers to trick Lucene into sorting them numerically (which causes challenges for BigDecimal, being discussed now on java-user). But if you could have Lucene sort by the number then numeric range queries would be straightforward.
Removed accidentally included IndexableBinaryString and its test from the patch (see
I think we should commit this to contrib/collation as an "external" way to get faster range filters on fields that require custom Collator; at some future point we can consider allowing a given field to sort its terms in some custom way.
Marvin: does KS/Lucy give control over sort order of the terms in a field? Steven, I'm hitting compilation errors, eg:
[javac] /tango/mike/src/lucene.collation/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java:42: package org.apache.lucene.queryParser.analyzing does not exist [javac] import org.apache.lucene.queryParser.analyzing.AnalyzingQueryParser; [javac] ^ [javac] /tango/mike/src/lucene.collation/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java:89: cannot find symbol What is AnalyzingQueryParser? It's in contrib/miscellaneous/
I used AnalyzingQueryParser in the tests to allow CollationKeyFilter to be applied to the terms in the range query - the standard QueryParser doesn't analyze range terms. From:
This is a (test-only) cross-contrib dependency. I'm not sure why I didn't have trouble with compilation - I haven't looked at this in months. I'll take a look later on tonight. OK, thanks for the pointer – I learn something new every day!
New patch that compiles.
I'm not sure how this ever worked previously - I must somehow have had lucene-misc-X.jar on the classpath or something. Anyway, the build.xml in this patch, cribbing from contrib/benchmark/build.xml, first builds contrib/miscellaneous, then adds build/contrib/miscellaneous/classes/java/ to the classpath, so that AnalyzingQueryParser can be linked against. Everything now compiles, and all contrib tests pass. Super, thanks Steven. I plan to commit soon.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
"same" tends to have a very specific meaning in Java documentation, .. it's usually used to indicate refrence equality (ie "==" not .equals) ...
so the question becomes: did they reall mean "same Collator" or did they mean "a Collator with the same rules" ?
is it safe to persist a CollationKey from a Collator A and then compare it with a CollationKey from another Collator B where A.equals(B) but A != B (because A and B are from different JVM instances?)