Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Since flexible indexing, terms are now represented as byte[], but for backwards compatibility reasons, they are not sorted as byte[], but instead as if they were char[].

      I think its time to look at sorting terms as byte[]... this would yield the following improvements:

      • terms are more opaque by default, they are byte[] and sort as byte[]. I think this would make lucene friendlier to customizations.
      • numerics and collation are then free to use their own encoding (full byte) rather than avoiding the use of certain bits to remain compatible with char[] sort order.
      • automaton gets simpler because as in LUCENE-2265, it uses byte[] too, and has special hacks because terms are sorted as char[]
      1. LUCENE-2426_automaton.patch
        7 kB
        Robert Muir
      2. LUCENE-2426.patch
        71 kB
        Michael McCandless
      3. LUCENE-2426.patch
        47 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          by the way: as mentioned above, as far as numerics and collation goes,
          both of these today avoid any of the parts of unicode that are sensitive to such a sort order change.

          So these already "backwards compatible" in the sense that numeric fields or
          collated fields will sort the same way in either UTF-8/UTF-32 byte[] order or UTF-16 char[] order.

          Show
          Robert Muir added a comment - by the way: as mentioned above, as far as numerics and collation goes, both of these today avoid any of the parts of unicode that are sensitive to such a sort order change. So these already "backwards compatible" in the sense that numeric fields or collated fields will sort the same way in either UTF-8/UTF-32 byte[] order or UTF-16 char[] order.
          Hide
          Yonik Seeley added a comment -

          big +1
          the more we get to pure bytes, the better IMO.

          Show
          Yonik Seeley added a comment - big +1 the more we get to pure bytes, the better IMO.
          Hide
          Robert Muir added a comment -

          I think most apps will be unaffected by this change (if the prefix-flex index convertor can sort the terms in binary, too).

          But we need to lookout for some traps:

          • Things that use String.compareTo are dangerous, as it uses code unit order (e.g. i see a binary search w/ this in FieldCache)
          • In general assuming a term can be a String at all is problematic with using byte[] terms, if numeric wants to use full byte, etc.
            So we should think about changing Term, too.

          the best way to avoid problems is to stick with byte[] as much as possible and try to avoid using String for terms...

          Show
          Robert Muir added a comment - I think most apps will be unaffected by this change (if the prefix-flex index convertor can sort the terms in binary, too). But we need to lookout for some traps: Things that use String.compareTo are dangerous, as it uses code unit order (e.g. i see a binary search w/ this in FieldCache) In general assuming a term can be a String at all is problematic with using byte[] terms, if numeric wants to use full byte, etc. So we should think about changing Term, too. the best way to avoid problems is to stick with byte[] as much as possible and try to avoid using String for terms...
          Hide
          Michael McCandless added a comment -

          Big +1 too

          For FieldCache, we need to do LUCENE-2380 (creates a BytesRef field cache) and switch Lucene to use it – I'll add a dependency.

          Show
          Michael McCandless added a comment - Big +1 too For FieldCache, we need to do LUCENE-2380 (creates a BytesRef field cache) and switch Lucene to use it – I'll add a dependency.
          Hide
          Michael McCandless added a comment -

          Checkpointing my current state here... it should compile but tests are probably failing from the mods in preflex codec.

          Show
          Michael McCandless added a comment - Checkpointing my current state here... it should compile but tests are probably failing from the mods in preflex codec.
          Hide
          Michael McCandless added a comment -

          Attached patch, changing term sort order to unicode codepoint! All tests pass. I fixed preflex codec to seek around surrogates, and then back again, so that preflex indices also sort properly; it's rather hairy... I added a new randomized test that writes a preflex segment (just the terms dict) with random terms and then asserts the order.

          Show
          Michael McCandless added a comment - Attached patch, changing term sort order to unicode codepoint! All tests pass. I fixed preflex codec to seek around surrogates, and then back again, so that preflex indices also sort properly; it's rather hairy... I added a new randomized test that writes a preflex segment (just the terms dict) with random terms and then asserts the order.
          Hide
          Robert Muir added a comment -

          How to deal with Term?

          I really don't like that Term.compareTo uses String.compareTo, for example MultiTermQuery uses this in TopTermsBooleanQueryRewrite for comparing terms in its priority queue.

          I don't think it should block this patch either, but we should at least open a second issue to figure out what to do about this.
          Term needs to either go away, or use BytesRef w/ the codec's comparator in cases like this, or some things like FuzzyQuery will be technically wrong (i should add a test for this too, I think)

          Show
          Robert Muir added a comment - How to deal with Term? I really don't like that Term.compareTo uses String.compareTo, for example MultiTermQuery uses this in TopTermsBooleanQueryRewrite for comparing terms in its priority queue. I don't think it should block this patch either, but we should at least open a second issue to figure out what to do about this. Term needs to either go away, or use BytesRef w/ the codec's comparator in cases like this, or some things like FuzzyQuery will be technically wrong (i should add a test for this too, I think)
          Hide
          Michael McCandless added a comment -

          How to deal with Term?

          Maybe we should keep it, but do a hard cutover of its .text from String to BytesRef, and also change its .compareTo to compare text by unicode code point order?

          I agree we should do this as a followon issue; in fact I think another issue is already open.

          Note, though, that field names still sort by UTF16 (String.compareTo) order.

          Show
          Michael McCandless added a comment - How to deal with Term? Maybe we should keep it, but do a hard cutover of its .text from String to BytesRef, and also change its .compareTo to compare text by unicode code point order? I agree we should do this as a followon issue; in fact I think another issue is already open. Note, though, that field names still sort by UTF16 (String.compareTo) order.

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development