Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Since flexible indexing, terms are now represented as byte[], but for backwards compatibility reasons, they are not sorted as byte[], but instead as if they were char[].

      I think its time to look at sorting terms as byte[]... this would yield the following improvements:

      • terms are more opaque by default, they are byte[] and sort as byte[]. I think this would make lucene friendlier to customizations.
      • numerics and collation are then free to use their own encoding (full byte) rather than avoiding the use of certain bits to remain compatible with char[] sort order.
      • automaton gets simpler because as in LUCENE-2265, it uses byte[] too, and has special hacks because terms are sorted as char[]
      1. LUCENE-2426.patch
        47 kB
        Michael McCandless
      2. LUCENE-2426_automaton.patch
        7 kB
        Robert Muir
      3. LUCENE-2426.patch
        71 kB
        Michael McCandless

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          52d 23h 17m 1 Michael McCandless 24/Jun/10 13:36
          Resolved Resolved Closed Closed
          1050d 21h 6m 1 Uwe Schindler 10/May/13 10:43
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Gavin made changes -
          Link This issue depends upon LUCENE-2442 [ LUCENE-2442 ]
          Gavin made changes -
          Link This issue depends on LUCENE-2442 [ LUCENE-2442 ]
          Gavin made changes -
          Link This issue depends upon LUCENE-2380 [ LUCENE-2380 ]
          Gavin made changes -
          Link This issue depends on LUCENE-2380 [ LUCENE-2380 ]
          Gavin made changes -
          Link This issue depends upon LUCENE-2265 [ LUCENE-2265 ]
          Gavin made changes -
          Link This issue depends on LUCENE-2265 [ LUCENE-2265 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12564294 ] jira [ 12584829 ]
          Mark Thomas made changes -
          Workflow jira [ 12509850 ] Default workflow, editable Closed status [ 12564294 ]
          Michael McCandless made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Michael McCandless added a comment -

          How to deal with Term?

          Maybe we should keep it, but do a hard cutover of its .text from String to BytesRef, and also change its .compareTo to compare text by unicode code point order?

          I agree we should do this as a followon issue; in fact I think another issue is already open.

          Note, though, that field names still sort by UTF16 (String.compareTo) order.

          Show
          Michael McCandless added a comment - How to deal with Term? Maybe we should keep it, but do a hard cutover of its .text from String to BytesRef, and also change its .compareTo to compare text by unicode code point order? I agree we should do this as a followon issue; in fact I think another issue is already open. Note, though, that field names still sort by UTF16 (String.compareTo) order.
          Hide
          Robert Muir added a comment -

          How to deal with Term?

          I really don't like that Term.compareTo uses String.compareTo, for example MultiTermQuery uses this in TopTermsBooleanQueryRewrite for comparing terms in its priority queue.

          I don't think it should block this patch either, but we should at least open a second issue to figure out what to do about this.
          Term needs to either go away, or use BytesRef w/ the codec's comparator in cases like this, or some things like FuzzyQuery will be technically wrong (i should add a test for this too, I think)

          Show
          Robert Muir added a comment - How to deal with Term? I really don't like that Term.compareTo uses String.compareTo, for example MultiTermQuery uses this in TopTermsBooleanQueryRewrite for comparing terms in its priority queue. I don't think it should block this patch either, but we should at least open a second issue to figure out what to do about this. Term needs to either go away, or use BytesRef w/ the codec's comparator in cases like this, or some things like FuzzyQuery will be technically wrong (i should add a test for this too, I think)
          Michael McCandless made changes -
          Attachment LUCENE-2426.patch [ 12447567 ]
          Hide
          Michael McCandless added a comment -

          Attached patch, changing term sort order to unicode codepoint! All tests pass. I fixed preflex codec to seek around surrogates, and then back again, so that preflex indices also sort properly; it's rather hairy... I added a new randomized test that writes a preflex segment (just the terms dict) with random terms and then asserts the order.

          Show
          Michael McCandless added a comment - Attached patch, changing term sort order to unicode codepoint! All tests pass. I fixed preflex codec to seek around surrogates, and then back again, so that preflex indices also sort properly; it's rather hairy... I added a new randomized test that writes a preflex segment (just the terms dict) with random terms and then asserts the order.
          Robert Muir made changes -
          Attachment LUCENE-2426_automaton.patch [ 12447476 ]
          Michael McCandless made changes -
          Attachment LUCENE-2426.patch [ 12447475 ]
          Hide
          Michael McCandless added a comment -

          Checkpointing my current state here... it should compile but tests are probably failing from the mods in preflex codec.

          Show
          Michael McCandless added a comment - Checkpointing my current state here... it should compile but tests are probably failing from the mods in preflex codec.
          Michael McCandless made changes -
          Link This issue is blocked by LUCENE-2378 [ LUCENE-2378 ]
          Uwe Schindler made changes -
          Fix Version/s 4.0 [ 12314025 ]
          Fix Version/s 3.1 [ 12314822 ]
          Robert Muir made changes -
          Link This issue depends on LUCENE-2442 [ LUCENE-2442 ]
          Michael McCandless made changes -
          Link This issue depends on LUCENE-2380 [ LUCENE-2380 ]
          Hide
          Michael McCandless added a comment -

          Big +1 too

          For FieldCache, we need to do LUCENE-2380 (creates a BytesRef field cache) and switch Lucene to use it – I'll add a dependency.

          Show
          Michael McCandless added a comment - Big +1 too For FieldCache, we need to do LUCENE-2380 (creates a BytesRef field cache) and switch Lucene to use it – I'll add a dependency.
          Hide
          Robert Muir added a comment -

          I think most apps will be unaffected by this change (if the prefix-flex index convertor can sort the terms in binary, too).

          But we need to lookout for some traps:

          • Things that use String.compareTo are dangerous, as it uses code unit order (e.g. i see a binary search w/ this in FieldCache)
          • In general assuming a term can be a String at all is problematic with using byte[] terms, if numeric wants to use full byte, etc.
            So we should think about changing Term, too.

          the best way to avoid problems is to stick with byte[] as much as possible and try to avoid using String for terms...

          Show
          Robert Muir added a comment - I think most apps will be unaffected by this change (if the prefix-flex index convertor can sort the terms in binary, too). But we need to lookout for some traps: Things that use String.compareTo are dangerous, as it uses code unit order (e.g. i see a binary search w/ this in FieldCache) In general assuming a term can be a String at all is problematic with using byte[] terms, if numeric wants to use full byte, etc. So we should think about changing Term, too. the best way to avoid problems is to stick with byte[] as much as possible and try to avoid using String for terms...
          Hide
          Yonik Seeley added a comment -

          big +1
          the more we get to pure bytes, the better IMO.

          Show
          Yonik Seeley added a comment - big +1 the more we get to pure bytes, the better IMO.
          Hide
          Robert Muir added a comment -

          by the way: as mentioned above, as far as numerics and collation goes,
          both of these today avoid any of the parts of unicode that are sensitive to such a sort order change.

          So these already "backwards compatible" in the sense that numeric fields or
          collated fields will sort the same way in either UTF-8/UTF-32 byte[] order or UTF-16 char[] order.

          Show
          Robert Muir added a comment - by the way: as mentioned above, as far as numerics and collation goes, both of these today avoid any of the parts of unicode that are sensitive to such a sort order change. So these already "backwards compatible" in the sense that numeric fields or collated fields will sort the same way in either UTF-8/UTF-32 byte[] order or UTF-16 char[] order.
          Robert Muir made changes -
          Field Original Value New Value
          Link This issue depends on LUCENE-2265 [ LUCENE-2265 ]
          Robert Muir created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development