Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      In LUCENE-2551 collation support was changed to use byte[] keys.

      Previously it encoded sort keys with IndexableBinaryString into char[],
      but this is wasteful with regards to RAM and disk when terms can be byte.

      A better solution would be [ICU]CollationFieldTypes, as this would also allow locale-sensitive range queries.

      1. SOLR-2396.patch
        48 kB
        Robert Muir
      2. SOLR-2396.patch
        48 kB
        Robert Muir
      3. SOLR-2396.patch
        19 kB
        Robert Muir
      4. SOLR-2396.patch
        31 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        patch using the Tokenizer factories, the factory returns KeywordTokenizer with the appropriate AttributeFactory.

        Show
        Robert Muir added a comment - patch using the Tokenizer factories, the factory returns KeywordTokenizer with the appropriate AttributeFactory.
        Hide
        Robert Muir added a comment -

        updated patch, instead adding CollatedField and ICUCollatedField.

        the trick was trying to get this thing to "use" my internal analyzer... setting TOKENIZED and changing SolrQueryParser to check isTokenized() instead of 'instanceof TextField' got things going.

        still needs unit tests but locale-sensitive range queries etc are working here.

        Show
        Robert Muir added a comment - updated patch, instead adding CollatedField and ICUCollatedField. the trick was trying to get this thing to "use" my internal analyzer... setting TOKENIZED and changing SolrQueryParser to check isTokenized() instead of 'instanceof TextField' got things going. still needs unit tests but locale-sensitive range queries etc are working here.
        Hide
        Robert Muir added a comment -

        FYI depending on how LUCENE-2943 is resolved, I'll adjust the patch for the ICU
        case to ensure all the operations are thread-safe... right now its not.

        Show
        Robert Muir added a comment - FYI depending on how LUCENE-2943 is resolved, I'll adjust the patch for the ICU case to ensure all the operations are thread-safe... right now its not.
        Hide
        Robert Muir added a comment -

        Updated patch, with unit tests for the CollationField and ICUCollationField.

        I think this is close.

        Show
        Robert Muir added a comment - Updated patch, with unit tests for the CollationField and ICUCollationField. I think this is close.
        Hide
        Robert Muir added a comment -

        editing summary/description to fit

        Show
        Robert Muir added a comment - editing summary/description to fit
        Hide
        Robert Muir added a comment -

        now that the lucene issues are resolved, I think this patch is really close.

        It could use a review though from someone who knows the fieldtype stuff better.

        Show
        Robert Muir added a comment - now that the lucene issues are resolved, I think this patch is really close. It could use a review though from someone who knows the fieldtype stuff better.
        Hide
        Toke Eskildsen added a comment -

        A rough idea: It seems that ICU Collator Keys are null-terminated. Would it be possible to allow for a key that contained the original String? Something like [collator-bytes][null][term-as-utf-8]? This embedding would make collator-ordered faceting with multiple terms/document much easier.

        Show
        Toke Eskildsen added a comment - A rough idea: It seems that ICU Collator Keys are null-terminated. Would it be possible to allow for a key that contained the original String? Something like [collator-bytes] [null] [term-as-utf-8] ? This embedding would make collator-ordered faceting with multiple terms/document much easier.
        Hide
        Robert Muir added a comment -

        A rough idea: It seems that ICU Collator Keys are null-terminated.

        This isn't always the case, at least at query-time for example if you are using a bound mode (http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CollationKey.BoundMode.html) I think for the UPPER_LONG case it does not exist.

        But, in any case I think we can't rely upon the fact that ICU might currently avoid zero bytes: this isn't really specified anywhere and just an optional impl detail (http://unicode.org/reports/tr10/#Avoiding_Zero_Bytes)

        Show
        Robert Muir added a comment - A rough idea: It seems that ICU Collator Keys are null-terminated. This isn't always the case, at least at query-time for example if you are using a bound mode ( http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CollationKey.BoundMode.html ) I think for the UPPER_LONG case it does not exist. But, in any case I think we can't rely upon the fact that ICU might currently avoid zero bytes: this isn't really specified anywhere and just an optional impl detail ( http://unicode.org/reports/tr10/#Avoiding_Zero_Bytes )
        Hide
        Toke Eskildsen added a comment -

        The JavaDoc for CollationKey is very explicit about the null-termination, but I do not know enough about the inner workings to judge whether a concatenation would work in all cases.

        Show
        Toke Eskildsen added a comment - The JavaDoc for CollationKey is very explicit about the null-termination, but I do not know enough about the inner workings to judge whether a concatenation would work in all cases.
        Hide
        Robert Muir added a comment -

        Well, its something we could be consider?, but on another issue really as its not solr-related... this one just exposes the lucene functionality from LUCENE-2551 and deprecates the old support.

        But really, this is going to be wasteful in most cases versus IDENTICAL strength (which will basically give someone the same functionality from a sort/range-query perspective).

        Show
        Robert Muir added a comment - Well, its something we could be consider?, but on another issue really as its not solr-related... this one just exposes the lucene functionality from LUCENE-2551 and deprecates the old support. But really, this is going to be wasteful in most cases versus IDENTICAL strength (which will basically give someone the same functionality from a sort/range-query perspective).
        Hide
        Robert Muir added a comment -

        I'd like to commit this in a few days if no one objects. The existing encoding is wasteful and I would like to cut solr over to this more efficient one (and enable locale-sensitive range queries).

        We could open future issues for any additional features such as specifying the icu locale as BCP47, etc, etc. (this just implements the lucene 3.1 functionality more efficiently)

        Show
        Robert Muir added a comment - I'd like to commit this in a few days if no one objects. The existing encoding is wasteful and I would like to cut solr over to this more efficient one (and enable locale-sensitive range queries). We could open future issues for any additional features such as specifying the icu locale as BCP47, etc, etc. (this just implements the lucene 3.1 functionality more efficiently)
        Hide
        Robert Muir added a comment -

        Committed revision 1086637.

        Show
        Robert Muir added a comment - Committed revision 1086637.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development