Lucene - Core
  1. Lucene - Core
  2. LUCENE-4511

TermsFilter might return wrong results if a field is not indexed or not present in the index

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0, 4.1, 6.0
    • Fix Version/s: 4.1, 6.0
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      TermsFilter returns if a term returns null from AIR#terms(term) while it should just continue. I will upload a test & fix shortly

      1. LUCENE-4511.patch
        28 kB
        Simon Willnauer
      2. LUCENE-4511.patch
        27 kB
        Simon Willnauer
      3. LUCENE-4511.patch
        17 kB
        Simon Willnauer
      4. LUCENE-4511.patch
        16 kB
        Simon Willnauer
      5. LUCENE-4511.patch
        3 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          here is a patch & testcase

          Show
          Simon Willnauer added a comment - here is a patch & testcase
          Hide
          Michael McCandless added a comment -

          Nice catch!

          Hmm should we set lastField = field (and termsEnum = null) before continue (so we don't keep calling fields.terms() on the non-existent field), and then change that bogus if to check if termsEnum != null?

          Show
          Michael McCandless added a comment - Nice catch! Hmm should we set lastField = field (and termsEnum = null) before continue (so we don't keep calling fields.terms() on the non-existent field), and then change that bogus if to check if termsEnum != null?
          Hide
          Simon Willnauer added a comment -

          here is a new patch basically rewriting the TermsFilter. I switch to a Term array rather than a treemap and made the filter immutable. I really don't like if you can cache this filter somehow and its mutable. I also moved the most of the work (figuring out when to reset the terms enum and when a field changes to the ctor rather than doing it for each segment. I also added some more tests

          Show
          Simon Willnauer added a comment - here is a new patch basically rewriting the TermsFilter. I switch to a Term array rather than a treemap and made the filter immutable. I really don't like if you can cache this filter somehow and its mutable. I also moved the most of the work (figuring out when to reset the terms enum and when a field changes to the ctor rather than doing it for each segment. I also added some more tests
          Hide
          Michael McCandless added a comment -

          Wow, this looks good! We could also make an outer array w/ one entry (holding field name & array of terms I guess) per field, instead of the array of booleans marking the transition.

          Hmm, but, you are calling terms.iterator once per term in each field? Can we call that only once per field instead?

          At some point/density it may be worth union-ing the terms into an A and using Terms.intersect ... we've talked about doing that before ... but we should do that separately.

          Show
          Michael McCandless added a comment - Wow, this looks good! We could also make an outer array w/ one entry (holding field name & array of terms I guess) per field, instead of the array of booleans marking the transition. Hmm, but, you are calling terms.iterator once per term in each field? Can we call that only once per field instead? At some point/density it may be worth union-ing the terms into an A and using Terms.intersect ... we've talked about doing that before ... but we should do that separately.
          Hide
          Simon Willnauer added a comment -

          new patch - moving the terms.iterator call into the block that checks if the field changed. I also remove copying the bytes which doesn't make sense here really.

          I don't think we should keep the fields around because then comparison always needs the previous and that makes the loop more complex.

          I also added a comment regarding the automaton idea.

          this is ready to go IMO.

          Show
          Simon Willnauer added a comment - new patch - moving the terms.iterator call into the block that checks if the field changed. I also remove copying the bytes which doesn't make sense here really. I don't think we should keep the fields around because then comparison always needs the previous and that makes the loop more complex. I also added a comment regarding the automaton idea. this is ready to go IMO.
          Hide
          Simon Willnauer added a comment -

          here is a new patch making this entire thing a bit more cache friendly but more complex. I changed the internal representation to serialized bytes and parallel arrays optimized for single field multiple terms case. This filters i often used with a large number of terms in a single field and usually cached for best performance. I also changed the class to be final and added a lot more tests.

          Show
          Simon Willnauer added a comment - here is a new patch making this entire thing a bit more cache friendly but more complex. I changed the internal representation to serialized bytes and parallel arrays optimized for single field multiple terms case. This filters i often used with a large number of terms in a single field and usually cached for best performance. I also changed the class to be final and added a lot more tests.
          Hide
          Simon Willnauer added a comment -

          if nobody objects I will commit the current patch tomorrow.

          Show
          Simon Willnauer added a comment - if nobody objects I will commit the current patch tomorrow.
          Hide
          Michael McCandless added a comment -

          Do we need to check for the "no terms provided" case (throw IAE)? Else we seem to make a TermsAndField w/ null field? Or is that harmless (matches no docs)...? Maybe we need a test for it ...

          I think the ArrayUtil.grow can be a < not a <=? Should we shrink the byte[] down in the end?

          Can we rename .terms -> .termBytes?

          Typo: "don't use case we could pollute the cache here easily" --> "don't use cache since we could pollute the cache here easily"

          Typo?: "no freq if we don't need them" -> "no freq since we don't need them"

          Maybe equals should also compare the hashCode first (since we compute/cache it up front)?

          Should currentTermsAndField be renamed to lastTermsAndField? It's always the last "completed" field right?

          Hmm I suddenly realized: I think this code is doing the same thing that FrozenBufferedDeletes does (see PrefixCodedTerms ... which takes even fewer bytes since it shares prefixes). Maybe we should just use that?

          Show
          Michael McCandless added a comment - Do we need to check for the "no terms provided" case (throw IAE)? Else we seem to make a TermsAndField w/ null field? Or is that harmless (matches no docs)...? Maybe we need a test for it ... I think the ArrayUtil.grow can be a < not a <=? Should we shrink the byte[] down in the end? Can we rename .terms -> .termBytes? Typo: "don't use case we could pollute the cache here easily" --> "don't use cache since we could pollute the cache here easily" Typo?: "no freq if we don't need them" -> "no freq since we don't need them" Maybe equals should also compare the hashCode first (since we compute/cache it up front)? Should currentTermsAndField be renamed to lastTermsAndField? It's always the last "completed" field right? Hmm I suddenly realized: I think this code is doing the same thing that FrozenBufferedDeletes does (see PrefixCodedTerms ... which takes even fewer bytes since it shares prefixes). Maybe we should just use that?
          Hide
          Chris Male added a comment -

          +1 to these improvements.

          Another typo: "to optimize for this case and to be fitler-cache friendly we " -> "filter-cache"

          Show
          Chris Male added a comment - +1 to these improvements. Another typo: "to optimize for this case and to be fitler-cache friendly we " -> "filter-cache"
          Hide
          Simon Willnauer added a comment -

          here is a new patch fixing the issues reported by mike & chris! thanks for reviewing.

          Regarding PrefixCodedTerms I don't think this buys us much here since usecases are not likely to share prefixes? For IDs (delete terms) this is more likely though).

          I think its ready, I will commit later today or tomorrow if nobody objects.

          Show
          Simon Willnauer added a comment - here is a new patch fixing the issues reported by mike & chris! thanks for reviewing. Regarding PrefixCodedTerms I don't think this buys us much here since usecases are not likely to share prefixes? For IDs (delete terms) this is more likely though). I think its ready, I will commit later today or tomorrow if nobody objects.
          Hide
          Michael McCandless added a comment -

          Regarding PrefixCodedTerms I don't think this buys us much here since usecases are not likely to share prefixes?

          Well I suspect TermsFilter is often used with many terms, at which
          point prefix coding will usually reduce memory required.

          Do you have a sense of how many terms typical ElasticSearch usage
          uses? Seems like it must be highish since we're compacting terms into
          single byte[] in the first place.

          It would also be nice to reusing existing same code instead of
          inventing yet another way to pack terms into bytes (hrm:
          FieldCache/DocValues is yet another place where we do this).

          But I agree we don't need to improve that now ... we can refactor
          later ... progress not perfection.

          Hmm maybe add an explicit test for the "no terms provided" case?
          (Maybe I missed it ...). Also: maybe this should not be IAE but
          rather just return a filter accepting nothing? (I think this is
          what current one does today). Ie, just don't add lastTermsAndField
          if previousField is null in the ctor).

          Otherwise +1 to the new patch. Thanks Simon!

          Show
          Michael McCandless added a comment - Regarding PrefixCodedTerms I don't think this buys us much here since usecases are not likely to share prefixes? Well I suspect TermsFilter is often used with many terms, at which point prefix coding will usually reduce memory required. Do you have a sense of how many terms typical ElasticSearch usage uses? Seems like it must be highish since we're compacting terms into single byte[] in the first place. It would also be nice to reusing existing same code instead of inventing yet another way to pack terms into bytes (hrm: FieldCache/DocValues is yet another place where we do this). But I agree we don't need to improve that now ... we can refactor later ... progress not perfection. Hmm maybe add an explicit test for the "no terms provided" case? (Maybe I missed it ...). Also: maybe this should not be IAE but rather just return a filter accepting nothing? (I think this is what current one does today). Ie, just don't add lastTermsAndField if previousField is null in the ctor). Otherwise +1 to the new patch. Thanks Simon!
          Hide
          Simon Willnauer added a comment -

          Well I suspect TermsFilter is often used with many terms, at which

          point prefix coding will usually reduce memory required.

          the main point here is reducing # of objects really. In lucene we often focus on reducing the memory footprint but even if we don't safe much here we are still friendly in terms of GC which is my main concern. so that is also why I don't care too much about the prefix coded stuff. Yet we should consolidate this. I will open another issue.

          Hmm maybe add an explicit test for the "no terms provided" case?

          I will add one before I commit. I don't think we should be smart here. Its likely a bug if nothing is provided.

          Show
          Simon Willnauer added a comment - Well I suspect TermsFilter is often used with many terms, at which point prefix coding will usually reduce memory required. the main point here is reducing # of objects really. In lucene we often focus on reducing the memory footprint but even if we don't safe much here we are still friendly in terms of GC which is my main concern. so that is also why I don't care too much about the prefix coded stuff. Yet we should consolidate this. I will open another issue. Hmm maybe add an explicit test for the "no terms provided" case? I will add one before I commit. I don't think we should be smart here. Its likely a bug if nothing is provided.
          Hide
          Uwe Schindler added a comment -

          Wasn't there another patch available that uses AutomatonTermsEnum with MTQ to provide the same functionality? The Automaton was this Dahizuk-Mihov-thingie. Maybe we can make a MultiTermQuery out of this one (the Filter is then incuded by the rewrite mode, too)?

          Show
          Uwe Schindler added a comment - Wasn't there another patch available that uses AutomatonTermsEnum with MTQ to provide the same functionality? The Automaton was this Dahizuk-Mihov-thingie. Maybe we can make a MultiTermQuery out of this one (the Filter is then incuded by the rewrite mode, too)?
          Hide
          Simon Willnauer added a comment -

          Ah uwe, too late. I already hit enter on the backport. you are referring to LUCENE-3893 I think this is useful lets see where it goes.

          Show
          Simon Willnauer added a comment - Ah uwe, too late. I already hit enter on the backport. you are referring to LUCENE-3893 I think this is useful lets see where it goes.
          Hide
          Uwe Schindler added a comment -

          It is not a problem, I was on vacation, so I only followed the mailing list on my mobile phone... We should in all cases port this automaton from the tests into core/query-module. I think the main issue is: Some tests in corre use it, but maybe we can move those tests to the module. Or we move TermsFilter to core...

          Show
          Uwe Schindler added a comment - It is not a problem, I was on vacation, so I only followed the mailing list on my mobile phone... We should in all cases port this automaton from the tests into core/query-module. I think the main issue is: Some tests in corre use it, but maybe we can move those tests to the module. Or we move TermsFilter to core...
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Simon Willnauer
          http://svn.apache.org/viewvc?view=revision&revision=1404132

          LUCENE-4511: TermsFilter might return wrong results if a field is not indexed or not present in the index

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1404132 LUCENE-4511 : TermsFilter might return wrong results if a field is not indexed or not present in the index

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Simon Willnauer
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development