Lucene - Core
  1. Lucene - Core
  2. LUCENE-1435

CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.

      This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

      1. LUCENE-1435.patch
        19 kB
        Steve Rowe
      2. LUCENE-1435.patch
        79 kB
        Steve Rowe
      3. LUCENE-1435.patch
        56 kB
        Steve Rowe
      4. LUCENE-1435.patch
        56 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          The one worry i have about an approach like this comes from the fine print of the CollationKey docs...

          You can only compare CollationKeys generated from the same Collator object.

          "same" tends to have a very specific meaning in Java documentation, .. it's usually used to indicate refrence equality (ie "==" not .equals) ...

          The equals method for class Object implements the most discriminating possible equivalence relation on objects; that is, for any non-null reference values x and y, this method returns true if and only if x and y refer to the same object (x == y has the value true).

          so the question becomes: did they reall mean "same Collator" or did they mean "a Collator with the same rules" ?

          is it safe to persist a CollationKey from a Collator A and then compare it with a CollationKey from another Collator B where A.equals(B) but A != B (because A and B are from different JVM instances?)

          Show
          Hoss Man added a comment - The one worry i have about an approach like this comes from the fine print of the CollationKey docs... You can only compare CollationKeys generated from the same Collator object. "same" tends to have a very specific meaning in Java documentation, .. it's usually used to indicate refrence equality (ie "==" not .equals) ... The equals method for class Object implements the most discriminating possible equivalence relation on objects; that is, for any non-null reference values x and y, this method returns true if and only if x and y refer to the same object (x == y has the value true). so the question becomes: did they reall mean "same Collator" or did they mean "a Collator with the same rules" ? is it safe to persist a CollationKey from a Collator A and then compare it with a CollationKey from another Collator B where A.equals(B) but A != B (because A and B are from different JVM instances?)
          Hide
          Robert Muir added a comment -

          at least in ICU, its not completely safe. If the different JVM instances are "different" in version (upgrade, etc) then it would be a shame to find your sorts all busted.

          When comparing keys, it is important to know that both keys were generated by the same algorithms and weightings. Otherwise, identical strings with keys generated on two different dates, for example, might compare as unequal. Sort keys can be affected by new versions of ICU or its data tables, new sort key formats, or changes to the Collator.

          http://www.icu-project.org/userguide/Collate_ServiceArchitecture.html

          Show
          Robert Muir added a comment - at least in ICU, its not completely safe. If the different JVM instances are "different" in version (upgrade, etc) then it would be a shame to find your sorts all busted. When comparing keys, it is important to know that both keys were generated by the same algorithms and weightings. Otherwise, identical strings with keys generated on two different dates, for example, might compare as unequal. Sort keys can be affected by new versions of ICU or its data tables, new sort key formats, or changes to the Collator. http://www.icu-project.org/userguide/Collate_ServiceArchitecture.html
          Hide
          Steve Rowe added a comment -

          Three problems I can think of off the top of my head with attempting an automatically managed solution to the problem of CollationKey comparability:

          1. There doesn't seem to be any way of ascertaining the RuleBasedCollator version, so one would have to store exact JVM version and Locale used to genenerate the Collator, and the strength used, and then fail any range or sort operations if the indexed CollationKeys were produced with ones different from the current ones.
          2. Lucene doesn't have an index-level per-field place to store arbitrary information.
          3. Other implementations of java.text.Collator, besides RuleBasedCollator, are certainly possible.

          So, it seems to me, either the user of this functionality has to manage the versioning external to the Lucene index, or they can't use the functionality .

          Would strong warnings in the javadocs be enough to allow people to take appropriate precautions?

          Show
          Steve Rowe added a comment - Three problems I can think of off the top of my head with attempting an automatically managed solution to the problem of CollationKey comparability: There doesn't seem to be any way of ascertaining the RuleBasedCollator version, so one would have to store exact JVM version and Locale used to genenerate the Collator, and the strength used, and then fail any range or sort operations if the indexed CollationKeys were produced with ones different from the current ones. Lucene doesn't have an index-level per-field place to store arbitrary information. Other implementations of java.text.Collator, besides RuleBasedCollator, are certainly possible. So, it seems to me, either the user of this functionality has to manage the versioning external to the Lucene index, or they can't use the functionality . Would strong warnings in the javadocs be enough to allow people to take appropriate precautions?
          Hide
          Robert Muir added a comment -

          One alternative is that the ICU implementation has versioning specifically for this purpose.

          The version information of Collator is a 32-bit integer. If a new version of ICU has changes affecting the content of collation elements, the version information will be changed. In that case, to use the new version of ICU collator will require regenerating any saved or stored sort keys. However, since ICU 1.8.1. it is possible to build your program so that it uses more than one version of ICU. Therefore, you could use the current version for the features you need and use the older version for collation.

          Show
          Robert Muir added a comment - One alternative is that the ICU implementation has versioning specifically for this purpose. The version information of Collator is a 32-bit integer. If a new version of ICU has changes affecting the content of collation elements, the version information will be changed. In that case, to use the new version of ICU collator will require regenerating any saved or stored sort keys. However, since ICU 1.8.1. it is possible to build your program so that it uses more than one version of ICU. Therefore, you could use the current version for the features you need and use the older version for collation.
          Hide
          Hoss Man added a comment -

          So, it seems to me, either the user of this functionality has to manage the versioning external to the Lucene index, or they can't use the functionality .

          Would strong warnings in the javadocs be enough to allow people to take appropriate precautions?

          I agree with you on both points ... this is really just an extension of warning people to use compatible analyzers when indexing/querying.

          (I only brought it up in my first comment because i know very little about the internals of any Collator Implementations out there, and i wasn't sure if all Implementations produces keys that were only comparable between "same" instances .. as long as there are some implementations of Collator that products keys which can be compared between "equivalent" instances, then this feature certainly seems useful.

          Show
          Hoss Man added a comment - So, it seems to me, either the user of this functionality has to manage the versioning external to the Lucene index, or they can't use the functionality . Would strong warnings in the javadocs be enough to allow people to take appropriate precautions? I agree with you on both points ... this is really just an extension of warning people to use compatible analyzers when indexing/querying. (I only brought it up in my first comment because i know very little about the internals of any Collator Implementations out there, and i wasn't sure if all Implementations produces keys that were only comparable between "same" instances .. as long as there are some implementations of Collator that products keys which can be compared between "equivalent" instances, then this feature certainly seems useful.
          Hide
          Steve Rowe added a comment -

          Robert Muir wrote:

          One alternative is that the ICU implementation has versioning specifically for this purpose.

          I'll look into using RegexQuery as a model here (it enables use of either java.util.regex or Jakarta Regexp, defaulting to java.util.regex), and try to add CollatorCapable/CollatorCapabilities, so that ICU's Collator implementation will be usable.

          Show
          Steve Rowe added a comment - Robert Muir wrote: One alternative is that the ICU implementation has versioning specifically for this purpose. I'll look into using RegexQuery as a model here (it enables use of either java.util.regex or Jakarta Regexp, defaulting to java.util.regex), and try to add CollatorCapable/CollatorCapabilities, so that ICU's Collator implementation will be usable.
          Hide
          Steve Rowe added a comment -

          Hoss wrote:

          So, it seems to me, either the user of this functionality has to manage the versioning external to the Lucene index, or they can't use the functionality .

          Would strong warnings in the javadocs be enough to allow people to take appropriate precautions?

          I agree with you on both points ... this is really just an extension of warning people to use compatible analyzers when indexing/querying.

          I will add warnings about this issue to the javadocs.

          Show
          Steve Rowe added a comment - Hoss wrote: So, it seems to me, either the user of this functionality has to manage the versioning external to the Lucene index, or they can't use the functionality . Would strong warnings in the javadocs be enough to allow people to take appropriate precautions? I agree with you on both points ... this is really just an extension of warning people to use compatible analyzers when indexing/querying. I will add warnings about this issue to the javadocs.
          Hide
          Steve Rowe added a comment -

          Modifications in this patch:

          1. Added dependency on ICU4J 4.0
          2. Introduced ICUCollationKeyFilter, which uses ICU collation to produce the collation keys
          3. Added Analyzer versions of the Filters, creating IndexableBinaryStringTools-encoded collation keys from the single token produced by KeywordTokenizer.
          4. Centralized testing to a base class, which the four test classes extend, to avoid duplication
          5. Moved from contrib/analyzers/o/a/l/analysis/miscellaneous/ to a new contrib package: contrib/collation, because it doesn't make sense to add a dependency to the entire contrib/analyzers package just for ICUCollationKeyFilter/Analyzer

          The external ICU4J dependency, which should be checked into contrib/collation/lib/, can be downloaded here: http://download.icu-project.org/files/icu4j/4.0/icu4j-4_0.jar. The license for this jar is included in the patch at contrib/collation/lib/ICU-LICENSE.txt.

          Show
          Steve Rowe added a comment - Modifications in this patch: Added dependency on ICU4J 4.0 Introduced ICUCollationKeyFilter, which uses ICU collation to produce the collation keys Added Analyzer versions of the Filters, creating IndexableBinaryStringTools-encoded collation keys from the single token produced by KeywordTokenizer. Centralized testing to a base class, which the four test classes extend, to avoid duplication Moved from contrib/analyzers/o/a/l/analysis/miscellaneous/ to a new contrib package: contrib/collation, because it doesn't make sense to add a dependency to the entire contrib/analyzers package just for ICUCollationKeyFilter/Analyzer The external ICU4J dependency, which should be checked into contrib/collation/lib/, can be downloaded here: http://download.icu-project.org/files/icu4j/4.0/icu4j-4_0.jar . The license for this jar is included in the patch at contrib/collation/lib/ICU-LICENSE.txt.
          Hide
          Michael McCandless added a comment -

          Could we, alternatively, push this change into DocumentsWriter, such that on writing a segment it uses a per-field Collator (FieldInfo would be extended to record this) to sort the terms dict?

          I haven't fully thought through the tradeoffs... but it seems like this'd be simpler to use? Ie rather than putting a CollationKeyFilter in your analyzer chain, and then doing the reverse of this for all searches at search time, you simply set the Collator on the fields (at indexing & searching time, since I agree we should for now not try to serialize into the index which field has which Collator)?

          I guess there is a performance cost to using the Collator to do live binary search (during searching) and sorting (during indexing) vs doing unicode String comparisions but in practice at search time this is probably a tiny part of the net cost of searching?

          Show
          Michael McCandless added a comment - Could we, alternatively, push this change into DocumentsWriter, such that on writing a segment it uses a per-field Collator (FieldInfo would be extended to record this) to sort the terms dict? I haven't fully thought through the tradeoffs... but it seems like this'd be simpler to use? Ie rather than putting a CollationKeyFilter in your analyzer chain, and then doing the reverse of this for all searches at search time, you simply set the Collator on the fields (at indexing & searching time, since I agree we should for now not try to serialize into the index which field has which Collator)? I guess there is a performance cost to using the Collator to do live binary search (during searching) and sorting (during indexing) vs doing unicode String comparisions but in practice at search time this is probably a tiny part of the net cost of searching?
          Hide
          Steve Rowe added a comment -

          Hi Mike,

          Could we, alternatively, push this change into DocumentsWriter, such that on writing a segment it uses a per-field Collator (FieldInfo would be extended to record this) to sort the terms dict?

          Are you suggesting to not store collation keys in the index?

          I haven't fully thought through the tradeoffs... but it seems like this'd be simpler to use? Ie rather than putting a CollationKeyFilter in your analyzer chain, and then doing the reverse of this for all searches at search time, you simply set the Collator on the fields (at indexing & searching time, since I agree we should for now not try to serialize into the index which field has which Collator)?

          The query-time process in this patch is not the reverse - it is exactly the same. The String-encoded collation keys stored in the index are compared directly with those from query terms. Neither the String-encoding nor the CollationKey needs to be reversed.

          I guess there is a performance cost to using the Collator to do live binary search (during searching) and sorting (during indexing) vs doing unicode String comparisions but in practice at search time this is probably a tiny part of the net cost of searching?

          In the current code base, for range searching on a collated field, every single term has to be collated with the search term. This patch allows skipTo to function when using collation, potentially providing a significant speedup.

          Show
          Steve Rowe added a comment - Hi Mike, Could we, alternatively, push this change into DocumentsWriter, such that on writing a segment it uses a per-field Collator (FieldInfo would be extended to record this) to sort the terms dict? Are you suggesting to not store collation keys in the index? I haven't fully thought through the tradeoffs... but it seems like this'd be simpler to use? Ie rather than putting a CollationKeyFilter in your analyzer chain, and then doing the reverse of this for all searches at search time, you simply set the Collator on the fields (at indexing & searching time, since I agree we should for now not try to serialize into the index which field has which Collator)? The query-time process in this patch is not the reverse - it is exactly the same. The String-encoded collation keys stored in the index are compared directly with those from query terms. Neither the String-encoding nor the CollationKey needs to be reversed. I guess there is a performance cost to using the Collator to do live binary search (during searching) and sorting (during indexing) vs doing unicode String comparisions but in practice at search time this is probably a tiny part of the net cost of searching? In the current code base, for range searching on a collated field, every single term has to be collated with the search term. This patch allows skipTo to function when using collation, potentially providing a significant speedup.
          Hide
          Michael McCandless added a comment -

          Are you suggesting to not store collation keys in the index?

          Right, I'm proposing storing the original Strings, but sorted
          according Collator.compare (for that one field), in the Terms dict.

          The query-time process in this patch is not the reverse - it is exactly the same.

          OK got it. Where/how would you implement the query time conversion of
          terms?

          And wouldn't there be times when you also want to reverse the
          encoding? EG if you enum all terms for presentation (maybe as part of
          faceted search for example)?

          In the current code base, for range searching on a collated field, every single term has to be collated with the search term. This patch allows skipTo to function when using collation, potentially providing a significant speedup.

          Both the original proposed approach (external-to-indexing) and this
          internal-to-indexing approach would solve this, right? Ie, in both
          cases the terms have been sorted according to the Collator, but in the
          internal-to-indexing case it's the original String term stored in the
          terms dict.

          Here are some pros of internal-to-indexing:

          • You don't have to convert every single term visited during
            analysis first to a CollationKey then ByteBuffer then encoded
            binary string. Indexing throughput should be faster? (Though,
            when writing the segment you do need to sort using
            Collator.compare, which I guess could be slow).
          • Real terms are stored in the index – tools like Luke can look at
            the index and see normal looking terms. Though... I don't have a
            sense of what the encoded term would look like – maybe it's not
            that different from the original in practice?
          • Querying would just work without term conversion

          And some cons:

          • It's obviously a more invasive change to Lucene (and probably
            should go after the flex-indexing changes). The
            external-to-indexing approach is nicely externalized.
          • Performance – the binary search of the terms index would be
            slower using Collator.compare instead of String.compareTo (though
            I would expect this to be minimal in practice).

          I'm sure there are many pros/cons I'm missing...

          Show
          Michael McCandless added a comment - Are you suggesting to not store collation keys in the index? Right, I'm proposing storing the original Strings, but sorted according Collator.compare (for that one field), in the Terms dict. The query-time process in this patch is not the reverse - it is exactly the same. OK got it. Where/how would you implement the query time conversion of terms? And wouldn't there be times when you also want to reverse the encoding? EG if you enum all terms for presentation (maybe as part of faceted search for example)? In the current code base, for range searching on a collated field, every single term has to be collated with the search term. This patch allows skipTo to function when using collation, potentially providing a significant speedup. Both the original proposed approach (external-to-indexing) and this internal-to-indexing approach would solve this, right? Ie, in both cases the terms have been sorted according to the Collator, but in the internal-to-indexing case it's the original String term stored in the terms dict. Here are some pros of internal-to-indexing: You don't have to convert every single term visited during analysis first to a CollationKey then ByteBuffer then encoded binary string. Indexing throughput should be faster? (Though, when writing the segment you do need to sort using Collator.compare, which I guess could be slow). Real terms are stored in the index – tools like Luke can look at the index and see normal looking terms. Though... I don't have a sense of what the encoded term would look like – maybe it's not that different from the original in practice? Querying would just work without term conversion And some cons: It's obviously a more invasive change to Lucene (and probably should go after the flex-indexing changes). The external-to-indexing approach is nicely externalized. Performance – the binary search of the terms index would be slower using Collator.compare instead of String.compareTo (though I would expect this to be minimal in practice). I'm sure there are many pros/cons I'm missing...
          Hide
          Steve Rowe added a comment -

          And wouldn't there be times when you also want to reverse the encoding? EG if you enum all terms for presentation (maybe as part of faceted search for example)?

          AFAIK, CollationKey generation is a one-way operation. If the original terms are required for presentation, they can be stored, right?

          Here are some pros of internal-to-indexing:
          [...]

          • Real terms are stored in the index - tools like Luke can look at
            the index and see normal looking terms. Though... I don't have a
            sense of what the encoded term would look like - maybe it's not
            that different from the original in practice?

          IndexableBinaryStringTools (LUCENE-1434) implements a base-8000h encoding: the lower 15 bits of each character have 1-7/8 bytes packed into them. It's radically different from the original byte array, at least in terms of looking at it with a text viewer like Luke. And I don't think CollationKeys themselves are intended for human consumption.

          In the current code base, for range searching on a collated field, every single term has to be collated with the search term. This patch allows skipTo to function when using collation, potentially providing a significant speedup.

          Both the original proposed approach (external-to-indexing) and this
          internal-to-indexing approach would solve this, right? Ie, in both
          cases the terms have been sorted according to the Collator, but in the
          internal-to-indexing case it's the original String term stored in the
          terms dict.

          Perhaps I'm missing something, but o.a.l.index.TermEnum.skipTo(Term) compares the target term using String.compareTo(), so regardless of the index term dictionary ordering, skipTo() won't necessarily stop at the correct location, right? From TermEnum.java:

            public boolean skipTo(Term target) throws IOException {
               do {
                  if (!next())
            	        return false;
               } while (target.compareTo(term()) > 0);
               return true;
            }
          

          and here's o.a.l.index.Term.compareTo(Term):

            public final int compareTo(Term other) {
              if (field == other.field)			  // fields are interned
                return text.compareTo(other.text);
              else
                return field.compareTo(other.field);
            }
          
          Show
          Steve Rowe added a comment - And wouldn't there be times when you also want to reverse the encoding? EG if you enum all terms for presentation (maybe as part of faceted search for example)? AFAIK, CollationKey generation is a one-way operation. If the original terms are required for presentation, they can be stored, right? Here are some pros of internal-to-indexing: [...] Real terms are stored in the index - tools like Luke can look at the index and see normal looking terms. Though... I don't have a sense of what the encoded term would look like - maybe it's not that different from the original in practice? IndexableBinaryStringTools ( LUCENE-1434 ) implements a base-8000h encoding: the lower 15 bits of each character have 1-7/8 bytes packed into them. It's radically different from the original byte array, at least in terms of looking at it with a text viewer like Luke. And I don't think CollationKeys themselves are intended for human consumption. In the current code base, for range searching on a collated field, every single term has to be collated with the search term. This patch allows skipTo to function when using collation, potentially providing a significant speedup. Both the original proposed approach (external-to-indexing) and this internal-to-indexing approach would solve this, right? Ie, in both cases the terms have been sorted according to the Collator, but in the internal-to-indexing case it's the original String term stored in the terms dict. Perhaps I'm missing something, but o.a.l.index.TermEnum.skipTo(Term) compares the target term using String.compareTo(), so regardless of the index term dictionary ordering, skipTo() won't necessarily stop at the correct location, right? From TermEnum.java: public boolean skipTo(Term target) throws IOException { do { if (!next()) return false ; } while (target.compareTo(term()) > 0); return true ; } and here's o.a.l.index.Term.compareTo(Term): public final int compareTo(Term other) { if (field == other.field) // fields are interned return text.compareTo(other.text); else return field.compareTo(other.field); }
          Hide
          Michael McCandless added a comment -

          IndexableBinaryStringTools (LUCENE-1434) implements a base-8000h encoding: the lower 15 bits of each character have 1-7/8 bytes packed into them. It's radically different from the original byte array, at least in terms of looking at it with a text viewer like Luke. And I don't think CollationKeys themselves are intended for human consumption.

          Oh OK. So having done this term conversion, you can't really look at / use the resulting terms in the index for human consumption (you'd have to store stuff yourself).

          Perhaps I'm missing something, but o.a.l.index.TermEnum.skipTo(Term) compares the target term using String.compareTo(),

          But we could just fix that to pay attention to the Collator for that field, if it has one, right? (Or with flexible indexing I think the impl really should own this method, ie, it should be abstract in TermEnum).

          I think the external approach is fine for starters... I just think long-term it may make sense to have core Lucene respect the Collator, but it really is an invasive change. We should wait until we make progress on flexible indexing at which point such a change should be far less costly.

          Show
          Michael McCandless added a comment - IndexableBinaryStringTools ( LUCENE-1434 ) implements a base-8000h encoding: the lower 15 bits of each character have 1-7/8 bytes packed into them. It's radically different from the original byte array, at least in terms of looking at it with a text viewer like Luke. And I don't think CollationKeys themselves are intended for human consumption. Oh OK. So having done this term conversion, you can't really look at / use the resulting terms in the index for human consumption (you'd have to store stuff yourself). Perhaps I'm missing something, but o.a.l.index.TermEnum.skipTo(Term) compares the target term using String.compareTo(), But we could just fix that to pay attention to the Collator for that field, if it has one, right? (Or with flexible indexing I think the impl really should own this method, ie, it should be abstract in TermEnum). I think the external approach is fine for starters... I just think long-term it may make sense to have core Lucene respect the Collator, but it really is an invasive change. We should wait until we make progress on flexible indexing at which point such a change should be far less costly.
          Hide
          Steve Rowe added a comment -

          Perhaps I'm missing something, but o.a.l.index.TermEnum.skipTo(Term) compares the target term using String.compareTo(),

          But we could just fix that to pay attention to the Collator for that field, if it has one, right? (Or with flexible indexing I think the impl really should own this method, ie, it should be abstract in TermEnum).

          Um, yes.

          I think the external approach is fine for starters... I just think long-term it may make sense to have core Lucene respect the Collator, but it really is an invasive change. We should wait until we make progress on flexible indexing at which point such a change should be far less costly.

          Now that I understand it, I too think the internal-to-indexing approach is cleaner/easier to use/better long-term. This patch is an attempt to improve on the performance of the range collation facilities introduced in LUCENE-1279. So I guess the question is whether it's worth putting in another less-than-optimal workaround.

          Show
          Steve Rowe added a comment - Perhaps I'm missing something, but o.a.l.index.TermEnum.skipTo(Term) compares the target term using String.compareTo(), But we could just fix that to pay attention to the Collator for that field, if it has one, right? (Or with flexible indexing I think the impl really should own this method, ie, it should be abstract in TermEnum). Um, yes. I think the external approach is fine for starters... I just think long-term it may make sense to have core Lucene respect the Collator, but it really is an invasive change. We should wait until we make progress on flexible indexing at which point such a change should be far less costly. Now that I understand it, I too think the internal-to-indexing approach is cleaner/easier to use/better long-term. This patch is an attempt to improve on the performance of the range collation facilities introduced in LUCENE-1279 . So I guess the question is whether it's worth putting in another less-than-optimal workaround.
          Hide
          Michael McCandless added a comment -

          Another use-case for allowing per-field custom sorting of Terms would be simpler numeric RangeQuery. Ie, right now you have to zero-pad numbers to trick Lucene into sorting them numerically (which causes challenges for BigDecimal, being discussed now on java-user). But if you could have Lucene sort by the number then numeric range queries would be straightforward.

          Show
          Michael McCandless added a comment - Another use-case for allowing per-field custom sorting of Terms would be simpler numeric RangeQuery. Ie, right now you have to zero-pad numbers to trick Lucene into sorting them numerically (which causes challenges for BigDecimal, being discussed now on java-user). But if you could have Lucene sort by the number then numeric range queries would be straightforward.
          Hide
          Steve Rowe added a comment -

          Removed accidentally included IndexableBinaryString and its test from the patch (see LUCENE-1434 for these).

          Show
          Steve Rowe added a comment - Removed accidentally included IndexableBinaryString and its test from the patch (see LUCENE-1434 for these).
          Hide
          Michael McCandless added a comment -

          I think we should commit this to contrib/collation as an "external" way to get faster range filters on fields that require custom Collator; at some future point we can consider allowing a given field to sort its terms in some custom way.

          Marvin: does KS/Lucy give control over sort order of the terms in a field?

          Show
          Michael McCandless added a comment - I think we should commit this to contrib/collation as an "external" way to get faster range filters on fields that require custom Collator; at some future point we can consider allowing a given field to sort its terms in some custom way. Marvin: does KS/Lucy give control over sort order of the terms in a field?
          Hide
          Michael McCandless added a comment -

          Steven, I'm hitting compilation errors, eg:

              [javac] /tango/mike/src/lucene.collation/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java:42: package org.apache.lucene.queryParser.analyzing does not exist
              [javac] import org.apache.lucene.queryParser.analyzing.AnalyzingQueryParser;
              [javac]                                               ^
              [javac] /tango/mike/src/lucene.collation/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java:89: cannot find symbol
          

          What is AnalyzingQueryParser?

          Show
          Michael McCandless added a comment - Steven, I'm hitting compilation errors, eg: [javac] /tango/mike/src/lucene.collation/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java:42: package org.apache.lucene.queryParser.analyzing does not exist [javac] import org.apache.lucene.queryParser.analyzing.AnalyzingQueryParser; [javac] ^ [javac] /tango/mike/src/lucene.collation/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java:89: cannot find symbol What is AnalyzingQueryParser?
          Hide
          Steve Rowe added a comment -

          It's in contrib/miscellaneous/

          I used AnalyzingQueryParser in the tests to allow CollationKeyFilter to be applied to the terms in the range query - the standard QueryParser doesn't analyze range terms.

          From:

          http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html

          Overrides Lucene's default QueryParser so that Fuzzy-, Prefix-, Range-, and WildcardQuerys are also passed through the given analyzer, but wild card characters (like *) don't get removed from the search terms.

          This is a (test-only) cross-contrib dependency. I'm not sure why I didn't have trouble with compilation - I haven't looked at this in months. I'll take a look later on tonight.

          Show
          Steve Rowe added a comment - It's in contrib/miscellaneous/ I used AnalyzingQueryParser in the tests to allow CollationKeyFilter to be applied to the terms in the range query - the standard QueryParser doesn't analyze range terms. From: http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html Overrides Lucene's default QueryParser so that Fuzzy-, Prefix-, Range-, and WildcardQuerys are also passed through the given analyzer, but wild card characters (like *) don't get removed from the search terms. This is a (test-only) cross-contrib dependency. I'm not sure why I didn't have trouble with compilation - I haven't looked at this in months. I'll take a look later on tonight.
          Hide
          Michael McCandless added a comment -

          OK, thanks for the pointer – I learn something new every day!

          Show
          Michael McCandless added a comment - OK, thanks for the pointer – I learn something new every day!
          Hide
          Steve Rowe added a comment -

          New patch that compiles.

          I'm not sure how this ever worked previously - I must somehow have had lucene-misc-X.jar on the classpath or something.

          Anyway, the build.xml in this patch, cribbing from contrib/benchmark/build.xml, first builds contrib/miscellaneous, then adds build/contrib/miscellaneous/classes/java/ to the classpath, so that AnalyzingQueryParser can be linked against.

          Everything now compiles, and all contrib tests pass.

          Show
          Steve Rowe added a comment - New patch that compiles. I'm not sure how this ever worked previously - I must somehow have had lucene-misc-X.jar on the classpath or something. Anyway, the build.xml in this patch, cribbing from contrib/benchmark/build.xml, first builds contrib/miscellaneous, then adds build/contrib/miscellaneous/classes/java/ to the classpath, so that AnalyzingQueryParser can be linked against. Everything now compiles, and all contrib tests pass.
          Hide
          Michael McCandless added a comment -

          Super, thanks Steven. I plan to commit soon.

          Show
          Michael McCandless added a comment - Super, thanks Steven. I plan to commit soon.
          Hide
          Michael McCandless added a comment -

          Thanks Steven!

          Show
          Michael McCandless added a comment - Thanks Steven!

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Steve Rowe
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development