Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      in LUCENE-2426, the sort order was changed to codepoint order.

      unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
      So MultiTermQuery, etc (especially its priority queues) are currently wrong.

      By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
      strange string encodings.

      1. LUCENE-2514.patch
        188 kB
        Robert Muir
      2. LUCENE-2514_collatedrange.patch
        151 kB
        Robert Muir
      3. LUCENE-2514_collatedrange.patch
        149 kB
        Robert Muir
      4. LUCENE-2514_collatedrange.patch
        49 kB
        Robert Muir
      5. LUCENE-2514_qp.patch
        22 kB
        Robert Muir
      6. LUCENE-2514.patch
        112 kB
        Uwe Schindler
      7. LUCENE-2514.patch
        112 kB
        Uwe Schindler
      8. LUCENE-2514.patch
        111 kB
        Robert Muir
      9. LUCENE-2514.patch
        62 kB
        Robert Muir
      10. LUCENE-2514.patch
        52 kB
        Uwe Schindler
      11. LUCENE-2514.patch
        49 kB
        Robert Muir
      12. LUCENE-2514.patch
        45 kB
        Robert Muir
      13. LUCENE-2514-MTQPagedBytes.patch
        12 kB
        Uwe Schindler
      14. LUCENE-2514-MTQPagedBytes.patch
        12 kB
        Uwe Schindler
      15. LUCENE-2514.patch
        54 kB
        Robert Muir
      16. LUCENE-2514-MTQPagedBytes.patch
        10 kB
        Uwe Schindler
      17. LUCENE-2514.patch
        41 kB
        Robert Muir
      18. LUCENE-2514.patch
        39 kB
        Uwe Schindler
      19. LUCENE-2514.patch
        30 kB
        Robert Muir
      20. LUCENE-2514.patch
        14 kB
        Robert Muir
      21. LUCENE-2514-surrogates-dance.patch
        2 kB
        Michael McCandless
      22. LUCENE-2514.patch
        13 kB
        Robert Muir
      23. LUCENE-2514.patch
        8 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          attached is one option: use bytesref behind the scenes but also support String ctors like we do today.

          i tried the 'hard cutover' mike suggested, but this is a massive change and I think typically users will just be using String.

          personally I don't see the harm in supporting Strings this way, any perf-sensitive stuff creating a lot of Term objects (e.g. MultiTermQuery) should be using BytesRef anyway.

          one test fails: the preflex TestSurrogates... does this test create terms with unpaired surrogates? If so this would explain the failures I think.

          Show
          Robert Muir added a comment - attached is one option: use bytesref behind the scenes but also support String ctors like we do today. i tried the 'hard cutover' mike suggested, but this is a massive change and I think typically users will just be using String. personally I don't see the harm in supporting Strings this way, any perf-sensitive stuff creating a lot of Term objects (e.g. MultiTermQuery) should be using BytesRef anyway. one test fails: the preflex TestSurrogates... does this test create terms with unpaired surrogates? If so this would explain the failures I think.
          Hide
          Michael McCandless added a comment -

          one test fails: the preflex TestSurrogates... does this test create terms with unpaired surrogates? If so this would explain the failures I think.

          Oh man not that one!!

          It is not supposed to create unpaired surrogates (it uses _TestUtil.makeRandomUnicodeString). I'll dig...

          Show
          Michael McCandless added a comment - one test fails: the preflex TestSurrogates... does this test create terms with unpaired surrogates? If so this would explain the failures I think. Oh man not that one!! It is not supposed to create unpaired surrogates (it uses _TestUtil.makeRandomUnicodeString). I'll dig...
          Hide
          Robert Muir added a comment -

          its my fault. preflex codec still uses Term.compareTo (and this is expected) in several places.

          so, i need to add back UTF8SortedAsUTF16Comparator to bytesref, and add a utf-16 compare to Term that uses it for the old behavior (or accomplish the equiv logic somewhere else)

          Show
          Robert Muir added a comment - its my fault. preflex codec still uses Term.compareTo (and this is expected) in several places. so, i need to add back UTF8SortedAsUTF16Comparator to bytesref, and add a utf-16 compare to Term that uses it for the old behavior (or accomplish the equiv logic somewhere else)
          Hide
          Robert Muir added a comment -

          i tried to fix the preflex problem like this, but it didnt work... maybe something (sort or other) is still relying on the old utf-16 compareTo() that i didnt find????

          Show
          Robert Muir added a comment - i tried to fix the preflex problem like this, but it didnt work... maybe something (sort or other) is still relying on the old utf-16 compareTo() that i didnt find????
          Hide
          Michael McCandless added a comment -

          Ahh sorry I do indeed seek to an unpaired high surrogate – easy to fix (pair it up w/ minimum low surrogate).

          But, there's another problem, which is that the pre-flex codec uses Term.compareTo, and it needs that to be based on UTF16. We can just fix the preflex codec to use its own (UTF8inUTF16order) comparator. I think once we fix those two the test should pass again (crossing fingers...).

          Show
          Michael McCandless added a comment - Ahh sorry I do indeed seek to an unpaired high surrogate – easy to fix (pair it up w/ minimum low surrogate). But, there's another problem, which is that the pre-flex codec uses Term.compareTo, and it needs that to be based on UTF16. We can just fix the preflex codec to use its own (UTF8inUTF16order) comparator. I think once we fix those two the test should pass again (crossing fingers...).
          Hide
          Michael McCandless added a comment -

          Attached patch to fix (I think) the unpaired surrogate passed to Term... not yet tested.

          Show
          Michael McCandless added a comment - Attached patch to fix (I think) the unpaired surrogate passed to Term... not yet tested.
          Hide
          Robert Muir added a comment -

          ok i combined my patch with yours, and fixed a typo (thats what i get for copy/paste)...

          i think it might be ok

          Show
          Robert Muir added a comment - ok i combined my patch with yours, and fixed a typo (thats what i get for copy/paste)... i think it might be ok
          Hide
          Yonik Seeley added a comment -

          Looks good!
          Might want to just add a doc on the Term constructor that the provided BytesRef is not copied, and should not be modified after construction (i.e. if you create a Term w/ the BytesRef provided by a TermsEnum or something, you're in trouble).

          Show
          Yonik Seeley added a comment - Looks good! Might want to just add a doc on the Term constructor that the provided BytesRef is not copied, and should not be modified after construction (i.e. if you create a Term w/ the BytesRef provided by a TermsEnum or something, you're in trouble).
          Hide
          Robert Muir added a comment -

          Might want to just add a doc on the Term constructor that the provided BytesRef is not copied, and should not be modified after construction (i.e. if you create a Term w/ the BytesRef provided by a TermsEnum or something, you're in trouble).

          Yeah, I agree. I'd also like to improve some of the perf sensitive spots to never create strings at all before committing (e.g. MultiTermQuery rewrites).

          Show
          Robert Muir added a comment - Might want to just add a doc on the Term constructor that the provided BytesRef is not copied, and should not be modified after construction (i.e. if you create a Term w/ the BytesRef provided by a TermsEnum or something, you're in trouble). Yeah, I agree. I'd also like to improve some of the perf sensitive spots to never create strings at all before committing (e.g. MultiTermQuery rewrites).
          Hide
          Robert Muir added a comment -

          here's an updated patch with yonik's suggested note (might need wording changes).

          Also i started converting some various code to use the bytes(), queries and such...

          Show
          Robert Muir added a comment - here's an updated patch with yonik's suggested note (might need wording changes). Also i started converting some various code to use the bytes(), queries and such...
          Hide
          Robert Muir added a comment -

          Mike, can you review some of the preflex changes to bytes() in the patch?
          I wonder if in some of these places we even need 'scratchBytesRef' at all now...

          Show
          Robert Muir added a comment - Mike, can you review some of the preflex changes to bytes() in the patch? I wonder if in some of these places we even need 'scratchBytesRef' at all now...
          Hide
          Michael McCandless added a comment -

          preflex changes look good.

          I think with this we can eliminate the scratchBytesRef entirely in PreFlexFields and instead just use termEnum.term().bytes()!

          Show
          Michael McCandless added a comment - preflex changes look good. I think with this we can eliminate the scratchBytesRef entirely in PreFlexFields and instead just use termEnum.term().bytes()!
          Hide
          Uwe Schindler added a comment -

          Robert: I can take MTQ tomorrow. I think we can remove the whole backwards stuff from MTQ and change completely to BytesRef (internally). This makes the steps TermsEnum (bytes) -> TermCollector -> TermQuery which converts all the time simplier. The collector abstract class in the MTQ rewrites will be much nicer.
          I can also remove the rest of pre-BoostAttribute stuff from TopTermsRewrite.

          I will go to sleep now, tomorrow more...

          Show
          Uwe Schindler added a comment - Robert: I can take MTQ tomorrow. I think we can remove the whole backwards stuff from MTQ and change completely to BytesRef (internally). This makes the steps TermsEnum (bytes) -> TermCollector -> TermQuery which converts all the time simplier. The collector abstract class in the MTQ rewrites will be much nicer. I can also remove the rest of pre-BoostAttribute stuff from TopTermsRewrite. I will go to sleep now, tomorrow more...
          Hide
          Robert Muir added a comment -

          Uwe thanks, I would prefer if you did MTQ too.

          I agree it should completely use BytesRef... i think it should also not create new Terms even until rewriting.

          For example, currently the priority queue in TopTerms does BytesRef -> String conversion and creates a new Term for each add, but this might be entirely useless as it could fall off the pq, so i think its ScoreTerm or whatever should not hold term at all but just bytesref.

          Show
          Robert Muir added a comment - Uwe thanks, I would prefer if you did MTQ too. I agree it should completely use BytesRef... i think it should also not create new Terms even until rewriting. For example, currently the priority queue in TopTerms does BytesRef -> String conversion and creates a new Term for each add, but this might be entirely useless as it could fall off the pq, so i think its ScoreTerm or whatever should not hold term at all but just bytesref.
          Hide
          Uwe Schindler added a comment -

          For example, currently the priority queue in TopTerms does BytesRef -> String conversion and creates a new Term for each add, but this might be entirely useless as it could fall off the pq, so i think its ScoreTerm or whatever should not hold term at all but just bytesref

          Exactly! We removed support for TermEnum (without s), so field name is never null. You can always take the field from the MTQ when building TermQueries. And for that we create the Term using new Term(field, BytesRef) or with the non-interning placeholder (see also below). This makes MTQ much simplier, I started to do it...

          By the way: we could remove all String interning for field names now? We don't compare fields anymore?

          Show
          Uwe Schindler added a comment - For example, currently the priority queue in TopTerms does BytesRef -> String conversion and creates a new Term for each add, but this might be entirely useless as it could fall off the pq, so i think its ScoreTerm or whatever should not hold term at all but just bytesref Exactly! We removed support for TermEnum (without s), so field name is never null. You can always take the field from the MTQ when building TermQueries. And for that we create the Term using new Term(field, BytesRef) or with the non-interning placeholder (see also below). This makes MTQ much simplier, I started to do it... By the way: we could remove all String interning for field names now? We don't compare fields anymore?
          Hide
          Yonik Seeley added a comment -

          By the way: we could remove all String interning for field names now? We don't compare fields anymore?

          Yeah, I noticed that too... I think so.

          Show
          Yonik Seeley added a comment - By the way: we could remove all String interning for field names now? We don't compare fields anymore? Yeah, I noticed that too... I think so.
          Hide
          Michael McCandless added a comment -

          We also need to fix FieldCache/TermRangeQuery, since they now take separate String upper/lower. We could just add corresponding BytesRef methods?

          By the way: we could remove all String interning for field names now? We don't compare fields anymore?

          We should be careful with this!

          Show
          Michael McCandless added a comment - We also need to fix FieldCache/TermRangeQuery, since they now take separate String upper/lower. We could just add corresponding BytesRef methods? By the way: we could remove all String interning for field names now? We don't compare fields anymore? We should be careful with this!
          Hide
          Robert Muir added a comment -

          We also need to fix FieldCache/TermRangeQuery, since they now take separate String upper/lower. We could just add corresponding BytesRef methods?

          I'll fix these. I didnt notice them since they don't use Term, but String directly. The patch is also generally incomplete in other ways... mainly I am searching on uses of Term.text() and such to ensure I do not introduce performance regressions.

          Show
          Robert Muir added a comment - We also need to fix FieldCache/TermRangeQuery, since they now take separate String upper/lower. We could just add corresponding BytesRef methods? I'll fix these. I didnt notice them since they don't use Term, but String directly. The patch is also generally incomplete in other ways... mainly I am searching on uses of Term.text() and such to ensure I do not introduce performance regressions.
          Hide
          Uwe Schindler added a comment -

          Here robert's patch with MTQ changed.

          It currently still uses placeholderTerms to not need to intern every time. If we remove string interning from Term, we can replace this by simple new Term() in MTQ.

          I delayed cloning of BytesRef until the BytesRef is put into a TermQuery or PQ or whenever it is set aside. But it no longer clones it e.g. if the term is never accepted by the PQ. Also the PQ reuses its ScoreTerm instances and so, the term bytes are simply copied over

          I also removed a Java 1.6 interface override - the Generics Policeman gives a ticket! I don't understand where those come from, Java 1.6 should also fail to compile as the ant build uses -source 1.5...?

          Show
          Uwe Schindler added a comment - Here robert's patch with MTQ changed. It currently still uses placeholderTerms to not need to intern every time. If we remove string interning from Term, we can replace this by simple new Term() in MTQ. I delayed cloning of BytesRef until the BytesRef is put into a TermQuery or PQ or whenever it is set aside. But it no longer clones it e.g. if the term is never accepted by the PQ. Also the PQ reuses its ScoreTerm instances and so, the term bytes are simply copied over I also removed a Java 1.6 interface override - the Generics Policeman gives a ticket! I don't understand where those come from, Java 1.6 should also fail to compile as the ant build uses -source 1.5...?
          Hide
          Robert Muir added a comment -

          I also removed a Java 1.6 interface override - the Generics Policeman gives a ticket! I don't understand where those come from, Java 1.6 should also fail to compile as the ant build uses -source 1.5...?

          Sorry, i am on a mac right now and i dont think i configured it correctly... (though ant test never complained... this is wierd).
          Normallly my IDE does not generate this... but at the same time it is something we should fix the build for, as i think Eclipse
          will generate these by default if configured for Java 6, which solr uses.

          Show
          Robert Muir added a comment - I also removed a Java 1.6 interface override - the Generics Policeman gives a ticket! I don't understand where those come from, Java 1.6 should also fail to compile as the ant build uses -source 1.5...? Sorry, i am on a mac right now and i dont think i configured it correctly... (though ant test never complained... this is wierd). Normallly my IDE does not generate this... but at the same time it is something we should fix the build for, as i think Eclipse will generate these by default if configured for Java 6, which solr uses.
          Hide
          Uwe Schindler added a comment -

          mainly I am searching on uses of Term.text() and such to ensure I do not introduce performance regressions.

          I would temporary remove all String methods from Term and try to compile core. If this works you should find all perf regressions in both directions.

          Show
          Uwe Schindler added a comment - mainly I am searching on uses of Term.text() and such to ensure I do not introduce performance regressions. I would temporary remove all String methods from Term and try to compile core. If this works you should find all perf regressions in both directions.
          Hide
          Robert Muir added a comment -

          attached is an updated patch, with uwe's changes, plus some additional conversions like TermsFilter and FieldCacheTermsFilter

          The range ones are a bit tricky, mainly because they work with collators with makes no sense with byte[]. but if collator is null then byte[] makes sense.

          the collator stuff is silly in a way, if we switch collation to byte[] it will use less ram than even the original String in lucene 3.x, and sort much faster.

          one option might be to split the collating range stuff into its own classes or something, i think its a bit confusing how collation is mixed in with 'binary' order... it tricks you into thinking the 'default' is UCA or default locale or something, but is neither.

          Show
          Robert Muir added a comment - attached is an updated patch, with uwe's changes, plus some additional conversions like TermsFilter and FieldCacheTermsFilter The range ones are a bit tricky, mainly because they work with collators with makes no sense with byte[]. but if collator is null then byte[] makes sense. the collator stuff is silly in a way, if we switch collation to byte[] it will use less ram than even the original String in lucene 3.x, and sort much faster. one option might be to split the collating range stuff into its own classes or something, i think its a bit confusing how collation is mixed in with 'binary' order... it tricks you into thinking the 'default' is UCA or default locale or something, but is neither.
          Hide
          Robert Muir added a comment -

          I would temporary remove all String methods from Term and try to compile core. If this works you should find all perf regressions in both directions.

          i did this, but there is a lot of unrelated stuff that uses Term string methods and is perfectly fine. especially many tests and things like queryparser.

          Show
          Robert Muir added a comment - I would temporary remove all String methods from Term and try to compile core. If this works you should find all perf regressions in both directions. i did this, but there is a lot of unrelated stuff that uses Term string methods and is perfectly fine. especially many tests and things like queryparser.
          Hide
          Uwe Schindler added a comment -

          I would temporary remove all String methods from Term and try to compile core. If this works you should find all perf regressions in both directions.

          i did this, but there is a lot of unrelated stuff that uses Term string methods and is perfectly fine. especially many tests and things like queryparser.

          Yeah tests can use string methods, i meant only core classes should compile without Term's String methods.

          one option might be to split the collating range stuff into its own classes or something, i think its a bit confusing how collation is mixed in with 'binary' order... it tricks you into thinking the 'default' is UCA or default locale or something, but is neither.

          I was always thinking about factoring out collation stuff from TermRangeQuery. I would like to have a pure TermRangeQuery without any collations things and maybe a CollationTermRangeQuery or whatever...

          Show
          Uwe Schindler added a comment - I would temporary remove all String methods from Term and try to compile core. If this works you should find all perf regressions in both directions. i did this, but there is a lot of unrelated stuff that uses Term string methods and is perfectly fine. especially many tests and things like queryparser. Yeah tests can use string methods, i meant only core classes should compile without Term's String methods. one option might be to split the collating range stuff into its own classes or something, i think its a bit confusing how collation is mixed in with 'binary' order... it tricks you into thinking the 'default' is UCA or default locale or something, but is neither. I was always thinking about factoring out collation stuff from TermRangeQuery. I would like to have a pure TermRangeQuery without any collations things and maybe a CollationTermRangeQuery or whatever...
          Hide
          Robert Muir added a comment -

          Yeah tests can use string methods, i meant only core classes should compile without Term's String methods.

          Ok, i will have another look at this (ignoring tests and queryparser and such). Maybe i will find something interesting.

          I was always thinking about factoring out collation stuff from TermRangeQuery. I would like to have a pure TermRangeQuery without any collations things and maybe a CollationTermRangeQuery or whatever...

          yeah, i think this would be better in the future too. but for now, i can work with what we have. it just means lots of things like String lowerBound; /* only used when collator != null */ and such

          Show
          Robert Muir added a comment - Yeah tests can use string methods, i meant only core classes should compile without Term's String methods. Ok, i will have another look at this (ignoring tests and queryparser and such). Maybe i will find something interesting. I was always thinking about factoring out collation stuff from TermRangeQuery. I would like to have a pure TermRangeQuery without any collations things and maybe a CollationTermRangeQuery or whatever... yeah, i think this would be better in the future too. but for now, i can work with what we have. it just means lots of things like String lowerBound; /* only used when collator != null */ and such
          Hide
          Uwe Schindler added a comment -

          Here an improvement of the MTQ only patch:

          The auto MTQ rewrite mode now collects all terms into a PagedBytes until cutoff. This maybe better memory-wise, not sure if this is really needed. For me it was just some usage training and the number of objects is lower, especially for large cutoff numbers.

          Mike?

          Show
          Uwe Schindler added a comment - Here an improvement of the MTQ only patch: The auto MTQ rewrite mode now collects all terms into a PagedBytes until cutoff. This maybe better memory-wise, not sure if this is really needed. For me it was just some usage training and the number of objects is lower, especially for large cutoff numbers. Mike?
          Hide
          Robert Muir added a comment - - edited

          edit: i added newlines and reformatted so the issue is readable. sorry.

          We also need to fix FieldCache/TermRangeQuery, since they now take separate String upper/lower. We could just add corresponding BytesRef methods?

          This actually makes the api ugly and problematic for ctors, because of open-ended values (null). it would be nice to avoid requiring users to cast here...

           
          reference to TermRangeQuery is ambiguous, both method 
          TermRangeQuery(java.lang.String,java.lang.String,java.lang.String,boolean,boolean) 
          in org.apache.lucene.search.TermRangeQuery and method 
          TermRangeQuery(java.lang.String,org.apache.lucene.util.BytesRef,
          org.apache.lucene.util.BytesRef,boolean,boolean) 
          in org.apache.lucene.search.TermRangeQuery match
              [javac]     TermRangeQuery query = new TermRangeQuery("content", null, null, true, true);
          

          attached is my updates to TermRangeQuery etc before realizing this. it also includes updated FieldCacheRangeQuery complete with generics violations.

          maybe when the policeman wakes up he will have ideas. i don't want the api to be ugly.

          Show
          Robert Muir added a comment - - edited edit: i added newlines and reformatted so the issue is readable. sorry. We also need to fix FieldCache/TermRangeQuery, since they now take separate String upper/lower. We could just add corresponding BytesRef methods? This actually makes the api ugly and problematic for ctors, because of open-ended values (null). it would be nice to avoid requiring users to cast here... reference to TermRangeQuery is ambiguous, both method TermRangeQuery(java.lang.String,java.lang.String,java.lang.String,boolean,boolean) in org.apache.lucene.search.TermRangeQuery and method TermRangeQuery(java.lang.String,org.apache.lucene.util.BytesRef, org.apache.lucene.util.BytesRef,boolean,boolean) in org.apache.lucene.search.TermRangeQuery match [javac] TermRangeQuery query = new TermRangeQuery("content", null, null, true, true); attached is my updates to TermRangeQuery etc before realizing this. it also includes updated FieldCacheRangeQuery complete with generics violations. maybe when the policeman wakes up he will have ideas. i don't want the api to be ugly.
          Hide
          Uwe Schindler added a comment -

          Improved version of PagedBytes MTQ cut off collector. It adds a method to PagedBytes.Reader to sequentially read all BytesRefs without a separate offset array.

          Mike, if you are fine with that, we should add this to the global patch for this issue.

          We should maybe also fix PagedBytes.freeze(boolean), as the parameter is currently unused. For the use case here, reallocating the last block is not really needed, it can stay as is. Maybe we should readd support for this parameter.

          Show
          Uwe Schindler added a comment - Improved version of PagedBytes MTQ cut off collector. It adds a method to PagedBytes.Reader to sequentially read all BytesRefs without a separate offset array. Mike, if you are fine with that, we should add this to the global patch for this issue. We should maybe also fix PagedBytes.freeze(boolean), as the parameter is currently unused. For the use case here, reallocating the last block is not really needed, it can stay as is. Maybe we should readd support for this parameter.
          Hide
          Uwe Schindler added a comment -

          This patch only adds a correct trim to PagesBytes.freeze(boolean). The impl was missing, but is important here for performance.

          Show
          Uwe Schindler added a comment - This patch only adds a correct trim to PagesBytes.freeze(boolean). The impl was missing, but is important here for performance.
          Hide
          Robert Muir added a comment -

          Uwe, thanks for the MTQ updates. i will merge them with my previous patch (not my latest broken one!)

          After looking at RangeQuery/Filter, i think i would prefer to make a subtask to refactor the collation part out.
          For example, we could make CollatedRangeQuery.

          In my opinion this is really very different from an ordinary binary-ordered range query and is confusing to both
          users and the code to be mixed in. I think we should consider fixing the API by splitting these and adding
          documentation to the migrate.txt

          Show
          Robert Muir added a comment - Uwe, thanks for the MTQ updates. i will merge them with my previous patch (not my latest broken one!) After looking at RangeQuery/Filter, i think i would prefer to make a subtask to refactor the collation part out. For example, we could make CollatedRangeQuery. In my opinion this is really very different from an ordinary binary-ordered range query and is confusing to both users and the code to be mixed in. I think we should consider fixing the API by splitting these and adding documentation to the migrate.txt
          Hide
          Uwe Schindler added a comment -

          +1

          Show
          Uwe Schindler added a comment - +1
          Hide
          Robert Muir added a comment -

          updated patch:

          • dropped the range query/filter changes so we can refactor those separately.
          • removed the BytesRef.EMPTY, uwe pointed out this is very dangerous since we cant prevent someone from modifying it.
          • tried to optimize preflex codec, removing more conversions and removing scratchbytesref copying... i think more can be done here.
          • merged in uwe's latest multitermquery changes.
          Show
          Robert Muir added a comment - updated patch: dropped the range query/filter changes so we can refactor those separately. removed the BytesRef.EMPTY, uwe pointed out this is very dangerous since we cant prevent someone from modifying it. tried to optimize preflex codec, removing more conversions and removing scratchbytesref copying... i think more can be done here. merged in uwe's latest multitermquery changes.
          Hide
          Robert Muir added a comment -

          ok i added a couple more optimizations to preflex, to avoid creating so many strings.

          i still don't like all the conversions for dealing with terms: is the problem just the shared prefixes and surrogate dance? I wonder if we can do something tricky to avoid this.

          Show
          Robert Muir added a comment - ok i added a couple more optimizations to preflex, to avoid creating so many strings. i still don't like all the conversions for dealing with terms: is the problem just the shared prefixes and surrogate dance? I wonder if we can do something tricky to avoid this.
          Hide
          Uwe Schindler added a comment -

          Converted the rest of TestSurrogates (removed FieldAnText class, now direktly uses Term, as Term now supports BytesRef directly). Test respects verbose param.

          Show
          Uwe Schindler added a comment - Converted the rest of TestSurrogates (removed FieldAnText class, now direktly uses Term, as Term now supports BytesRef directly). Test respects verbose param.
          Hide
          Robert Muir added a comment -

          Thanks Uwe. I added some additional changes:

          • naive switch of MemoryIndex to byte terms (this one can use pagedbytes maybe too)
          • removed some more unnecessary byte->string in instantiated, uninvertedfield
          Show
          Robert Muir added a comment - Thanks Uwe. I added some additional changes: naive switch of MemoryIndex to byte terms (this one can use pagedbytes maybe too) removed some more unnecessary byte->string in instantiated, uninvertedfield
          Hide
          Robert Muir added a comment -

          the term vectors api should be in sync with Term, but still uses String.
          this patch switches them to bytesref too.

          Show
          Robert Muir added a comment - the term vectors api should be in sync with Term, but still uses String. this patch switches them to bytesref too.
          Hide
          Uwe Schindler added a comment -

          Updated patch with recent trunk commits (TestTV, TestSurrogates)

          Show
          Uwe Schindler added a comment - Updated patch with recent trunk commits (TestTV, TestSurrogates)
          Hide
          Uwe Schindler added a comment -

          We should commit this soon and solve the rest separately?

          Robert? Mike?

          Show
          Uwe Schindler added a comment - We should commit this soon and solve the rest separately? Robert? Mike?
          Hide
          Michael McCandless added a comment -

          +1

          Show
          Michael McCandless added a comment - +1
          Hide
          Robert Muir added a comment -

          I don't have a real computer for a few days, so take it if you want!

          Show
          Robert Muir added a comment - I don't have a real computer for a few days, so take it if you want!
          Hide
          Uwe Schindler added a comment -

          I take it and will commit it tomorrow.

          Show
          Uwe Schindler added a comment - I take it and will commit it tomorrow.
          Hide
          Uwe Schindler added a comment -

          Committed this patch revision: 960484

          I keep this open, as more improvements may be added (e.g. TermRangeQuery)

          Show
          Uwe Schindler added a comment - Committed this patch revision: 960484 I keep this open, as more improvements may be added (e.g. TermRangeQuery)
          Hide
          Robert Muir added a comment -

          With Term as byte, and tokenstreams can encode terms to byte however they want with TermToBytesRefAttribute, it makes sense for queryparsers to consume bytes like the indexer, and build terms without an intermediate String.

          This way non-unicode terms (e.g. collation) work as expected.

          This patch updates the queryparsers, except for contrib/queryparser (which will be more serious and cause API changes), and the range query building AnalyzingQueryParser (we need to fix TermRangeQuery first).

          All tests pass.

          Show
          Robert Muir added a comment - With Term as byte, and tokenstreams can encode terms to byte however they want with TermToBytesRefAttribute, it makes sense for queryparsers to consume bytes like the indexer, and build terms without an intermediate String. This way non-unicode terms (e.g. collation) work as expected. This patch updates the queryparsers, except for contrib/queryparser (which will be more serious and cause API changes), and the range query building AnalyzingQueryParser (we need to fix TermRangeQuery first). All tests pass.
          Hide
          Robert Muir added a comment -

          I'd like to commit this queryparser patch tomorrow if no one objects. Then I think we should look at range query, etc.

          Show
          Robert Muir added a comment - I'd like to commit this queryparser patch tomorrow if no one objects. Then I think we should look at range query, etc.
          Hide
          Michael McCandless added a comment -

          +1 to commit

          This would also mean the BOCU-1 encoding could be used drop-in w/ QueryParser for basic (Term, Phrase) queries right?

          Show
          Michael McCandless added a comment - +1 to commit This would also mean the BOCU-1 encoding could be used drop-in w/ QueryParser for basic (Term, Phrase) queries right?
          Hide
          Robert Muir added a comment -

          This would also mean the BOCU-1 encoding could be used drop-in w/ QueryParser for basic (Term, Phrase) queries right?

          Yes, they should then work (or there is a bug!)

          Show
          Robert Muir added a comment - This would also mean the BOCU-1 encoding could be used drop-in w/ QueryParser for basic (Term, Phrase) queries right? Yes, they should then work (or there is a bug!)
          Hide
          Robert Muir added a comment -

          Committed LUCENE-2514_qp.patch revision 966254

          Show
          Robert Muir added a comment - Committed LUCENE-2514 _qp.patch revision 966254
          Hide
          Robert Muir added a comment -

          In order to move forward with collation-keys-as-byte and other improvements, we need to fix TermRangeQuery.
          But this is difficult when the String-only Collation support exists mixed with the byte-order TermRangeQuery...

          As discussed previously on this issue, here is a patch that splits this into a separate CollatedTermRangeQuery/Filter

          Show
          Robert Muir added a comment - In order to move forward with collation-keys-as-byte and other improvements, we need to fix TermRangeQuery. But this is difficult when the String-only Collation support exists mixed with the byte-order TermRangeQuery... As discussed previously on this issue, here is a patch that splits this into a separate CollatedTermRangeQuery/Filter
          Hide
          Robert Muir added a comment -

          by the way, i was thinking it would be nice to really move this slow collatedtermrangequery stuff either out of lucene alltogether or at least into contrib/queries.

          we could make things even better by removing queryparser's get/setRangeCollator method.
          instead in its place, it could have something like a boolean 'analyzeRangeQueries' ?
          it could then analyze the endpoints (producing byte collation keys) and use a regular fast term range query.

          I think its good to support collation order for people who want it, but we should make it easy to do things the fast way,
          right now we make it easy to do things the slow way and hard to do it fast.

          Show
          Robert Muir added a comment - by the way, i was thinking it would be nice to really move this slow collatedtermrangequery stuff either out of lucene alltogether or at least into contrib/queries. we could make things even better by removing queryparser's get/setRangeCollator method. instead in its place, it could have something like a boolean 'analyzeRangeQueries' ? it could then analyze the endpoints (producing byte collation keys) and use a regular fast term range query. I think its good to support collation order for people who want it, but we should make it easy to do things the fast way, right now we make it easy to do things the slow way and hard to do it fast.
          Hide
          Michael McCandless added a comment -

          by the way, i was thinking it would be nice to really move this slow collatedtermrangequery stuff either out of lucene alltogether or at least into contrib/queries.

          +1

          I agree we have it backwards now. The "obvious" approach should be the performant one.

          Show
          Michael McCandless added a comment - by the way, i was thinking it would be nice to really move this slow collatedtermrangequery stuff either out of lucene alltogether or at least into contrib/queries. +1 I agree we have it backwards now. The "obvious" approach should be the performant one.
          Hide
          Uwe Schindler added a comment -

          by the way, i was thinking it would be nice to really move this slow collatedtermrangequery stuff either out of lucene alltogether or at least into contrib/queries.

          +1

          By the way, the problem BytesRef vs. String is not yet solved for core TRQ. I would prefer to do it like for NRQ/FCRF with static factory methods. Then its also consistent accross all RQ parts.

          Show
          Uwe Schindler added a comment - by the way, i was thinking it would be nice to really move this slow collatedtermrangequery stuff either out of lucene alltogether or at least into contrib/queries. +1 By the way, the problem BytesRef vs. String is not yet solved for core TRQ. I would prefer to do it like for NRQ/FCRF with static factory methods. Then its also consistent accross all RQ parts.
          Hide
          Steve Rowe added a comment -

          by the way, i was thinking it would be nice to really move this slow collatedtermrangequery stuff either out of lucene alltogether or at least into contrib/queries.

          +1

          Show
          Steve Rowe added a comment - by the way, i was thinking it would be nice to really move this slow collatedtermrangequery stuff either out of lucene alltogether or at least into contrib/queries. +1
          Hide
          Robert Muir added a comment -

          By the way, the problem BytesRef vs. String is not yet solved for core TRQ. I would prefer to do it like for NRQ/FCRF with static factory methods. Then its also consistent accross all RQ parts.

          Yes I know! I was leaving this for you, but if you have no time, I can take care of it.
          When that too is done, finally I think then I can commit LUCENE-2551 !

          Show
          Robert Muir added a comment - By the way, the problem BytesRef vs. String is not yet solved for core TRQ. I would prefer to do it like for NRQ/FCRF with static factory methods. Then its also consistent accross all RQ parts. Yes I know! I was leaving this for you, but if you have no time, I can take care of it. When that too is done, finally I think then I can commit LUCENE-2551 !
          Hide
          Robert Muir added a comment -

          just checkpointing progress, here's my latest patch.

          Here i moved the slow functionality (range,sort) out of core and into contrib/queries.
          So TermRangeQuery just does byte comparison, nothing fancy.
          Additionally TermRangeQuery's API is changed to be more like NumericRangeQuery's, with newStringRange and newByteRange.

          TODO:

          • QP's newRangeQuery args should be changed to BytesRef, and newRangeQuery should build ranges with newByteRange
          • contrib/qp needs a new attribute and some other work, and some of the other queryparsers need more changes too.
          • need to add tests (the ones i removed from core) for SlowCollatedRangeQuery and friends.
          Show
          Robert Muir added a comment - just checkpointing progress, here's my latest patch. Here i moved the slow functionality (range,sort) out of core and into contrib/queries. So TermRangeQuery just does byte comparison, nothing fancy. Additionally TermRangeQuery's API is changed to be more like NumericRangeQuery's, with newStringRange and newByteRange. TODO: QP's newRangeQuery args should be changed to BytesRef, and newRangeQuery should build ranges with newByteRange contrib/qp needs a new attribute and some other work, and some of the other queryparsers need more changes too. need to add tests (the ones i removed from core) for SlowCollatedRangeQuery and friends.
          Hide
          Michael McCandless added a comment -

          Patch looks good Robert!

          I like the TermRangeQuery/Filter.newStringRange static factory. You need to add the Slow prefix to the class names in MIGRATE.txt, and also mention the slow collated comparator has also moved.

          Show
          Michael McCandless added a comment - Patch looks good Robert! I like the TermRangeQuery/Filter.newStringRange static factory. You need to add the Slow prefix to the class names in MIGRATE.txt, and also mention the slow collated comparator has also moved.
          Hide
          Uwe Schindler added a comment - - edited

          Yah, the factory makes it easy for new users to create string ranges (as one exspects, like with NRQ, NRF, FCRF), but as the query itsself works on BytesRef, its ctor takes BytesRef. With the static factory, compile errors when you use null as bounds are not happening.

          Do we need a LessString() or should we completely remove the useless static factories named Less in the filter?

          Show
          Uwe Schindler added a comment - - edited Yah, the factory makes it easy for new users to create string ranges (as one exspects, like with NRQ, NRF, FCRF), but as the query itsself works on BytesRef, its ctor takes BytesRef. With the static factory, compile errors when you use null as bounds are not happening. Do we need a LessString() or should we completely remove the useless static factories named Less in the filter?
          Hide
          Robert Muir added a comment -

          I like the TermRangeQuery/Filter.newStringRange static factory. You need to add the Slow prefix to the class names in MIGRATE.txt, and also mention the slow collated comparator has also moved.

          Yes, i forgot this from my TODO list!

          or should we completely remove the useless static factories named Less in the filter?

          +1. I didnt see these used anywhere. If we aren't going to remove them, then we should at least deprecate IMO

          Show
          Robert Muir added a comment - I like the TermRangeQuery/Filter.newStringRange static factory. You need to add the Slow prefix to the class names in MIGRATE.txt, and also mention the slow collated comparator has also moved. Yes, i forgot this from my TODO list! or should we completely remove the useless static factories named Less in the filter? +1. I didnt see these used anywhere. If we aren't going to remove them, then we should at least deprecate IMO
          Hide
          Robert Muir added a comment -

          I brought the patch up to trunk, it was very out of date.
          Will try to make it committable soon!

          Show
          Robert Muir added a comment - I brought the patch up to trunk, it was very out of date. Will try to make it committable soon!
          Hide
          Uwe Schindler added a comment -

          Robert, can you commit your updates if its time to do it? You can assign to youself, if you like!

          Show
          Uwe Schindler added a comment - Robert, can you commit your updates if its time to do it? You can assign to youself, if you like!
          Hide
          Robert Muir added a comment -

          Uwe, thanks for the reminder. I will try to actually wrap this one up today... needs some tests etc.

          If you get a chance, can you take a quick glance at the Attributes-code in LUCENE-2551 ?
          I would like to resolve both issues together so that no functionality is lost.

          Show
          Robert Muir added a comment - Uwe, thanks for the reminder. I will try to actually wrap this one up today... needs some tests etc. If you get a chance, can you take a quick glance at the Attributes-code in LUCENE-2551 ? I would like to resolve both issues together so that no functionality is lost.
          Hide
          Robert Muir added a comment -

          attached is an updated patch. Its ready to go, I plan on committing soon.

          Unrelated: i wrote a nasty random test for the SlowCollation stuff, but it fails if you get an executor service (some unrelated synchronization thing). This is some existing problem, I'll open another issue for this.

          Show
          Robert Muir added a comment - attached is an updated patch. Its ready to go, I plan on committing soon. Unrelated: i wrote a nasty random test for the SlowCollation stuff, but it fails if you get an executor service (some unrelated synchronization thing). This is some existing problem, I'll open another issue for this.
          Hide
          Uwe Schindler added a comment -

          I think this patch is fine! Will go to bed now, I just wanted you to know that I am really fine with it! Great work!

          Show
          Uwe Schindler added a comment - I think this patch is fine! Will go to bed now, I just wanted you to know that I am really fine with it! Great work!
          Hide
          Robert Muir added a comment -

          Committed revision 1075210.

          Show
          Robert Muir added a comment - Committed revision 1075210.

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development