Lucene - Core
  1. Lucene - Core
  2. LUCENE-1603

Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4, 2.9
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This is a patch, that is needed for the MultiTermQuery-rewrite of TrieRange (LUCENE-1602):

      • Make the private members protected, to have access to them from the very special TrieRangeTermEnum
      • Fix a small inconsistency (docFreq() now only returns a value, if a valid term is existing)
      • Improvement of MultiTermFilter.getDocIdSet to return DocIdSet.EMPTY_DOCIDSET, if the TermEnum is empty (less memory usage) and faster.
      • Add the getLastNumberOfTerms() to MultiTermQuery for statistics on different multi term queries and how may terms they affect, using this new functionality, the improvement of TrieRange can be shown (extract from test case there, 10000 docs index, long values):
        [junit] Average number of terms during random search on 'field8':
        [junit]  Trie query: 244.2
        [junit]  Classical query: 3136.94
        [junit] Average number of terms during random search on 'field4':
        [junit]  Trie query: 38.3
        [junit]  Classical query: 3018.68
        [junit] Average number of terms during random search on 'field2':
        [junit]  Trie query: 18.04
        [junit]  Classical query: 3539.42
        

      All core tests pass.

      1. LUCENE-1603.patch
        16 kB
        Uwe Schindler
      2. LUCENE-1603.patch
        7 kB
        Uwe Schindler
      3. LUCENE-1603.patch
        7 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          The patch is here. Maybe the javadocs of getLastNumberOfTerms could be improved...

          Show
          Uwe Schindler added a comment - The patch is here. Maybe the javadocs of getLastNumberOfTerms could be improved...
          Hide
          Michael McCandless added a comment -

          Patch looks good. Should we allow lastNumberOfTerms to be the sum of all invocations? (Instead of clearing it per segment)? And maybe add a resetLastNumberOfTerms, in case one wants to re-use a MultiTermQuery and recheck that count.

          Show
          Michael McCandless added a comment - Patch looks good. Should we allow lastNumberOfTerms to be the sum of all invocations? (Instead of clearing it per segment)? And maybe add a resetLastNumberOfTerms, in case one wants to re-use a MultiTermQuery and recheck that count.
          Hide
          Uwe Schindler added a comment -

          This is also another idea. The sum is interesting, too, but not so helpful, if the segments share the same terms. But for the number of seeks in TermDocs its interesting. Even if it sums up, the numbers in the Trie tests are the same, because index is optimized. So the note could be: For real numbers to analyze your query, use a optimized index, for real seek counts on a complex index/query structure, do not optimize.

          By the way: Are queries also rewritten per segment with the new Searchers? If not, one could use the BooleanQuery variant, if he wants to have real term numbers on unoptimized index.

          I am +/- 0 about the change. In case of a change, the method should be called getCurrentNumberOfTerms() or something like that – naming is the hardest one.

          Show
          Uwe Schindler added a comment - This is also another idea. The sum is interesting, too, but not so helpful, if the segments share the same terms. But for the number of seeks in TermDocs its interesting. Even if it sums up, the numbers in the Trie tests are the same, because index is optimized. So the note could be: For real numbers to analyze your query, use a optimized index, for real seek counts on a complex index/query structure, do not optimize. By the way: Are queries also rewritten per segment with the new Searchers? If not, one could use the BooleanQuery variant, if he wants to have real term numbers on unoptimized index. I am +/- 0 about the change. In case of a change, the method should be called getCurrentNumberOfTerms() or something like that – naming is the hardest one.
          Hide
          Michael McCandless added a comment -

          I was thinking that this count is a good way to measure how much net work was done, hence the switch to sum. EG you could compare that count vs the count you get after having optimized the index to get a sense of how much you gained by optimizing.

          Whereas now, with the count only showing the # terms from the last segment searched, is not really useful at all.

          Are queries also rewritten per segment with the new Searchers? If not, one could use the BooleanQuery variant, if he wants to have real term numbers on unoptimized index.

          They are rewritten at the MultiReader level, so you're right one could use that to get "number of unique terms" vs "amount of work (seeks) done".

          If we do change it, ow about "get/clearTotalNumberOfTerms()"?

          Show
          Michael McCandless added a comment - I was thinking that this count is a good way to measure how much net work was done, hence the switch to sum. EG you could compare that count vs the count you get after having optimized the index to get a sense of how much you gained by optimizing. Whereas now, with the count only showing the # terms from the last segment searched, is not really useful at all. Are queries also rewritten per segment with the new Searchers? If not, one could use the BooleanQuery variant, if he wants to have real term numbers on unoptimized index. They are rewritten at the MultiReader level, so you're right one could use that to get "number of unique terms" vs "amount of work (seeks) done". If we do change it, ow about "get/clearTotalNumberOfTerms()"?
          Hide
          Uwe Schindler added a comment -

          Maybe declare the term number as transient, it should not be serialized...

          Show
          Uwe Schindler added a comment - Maybe declare the term number as transient, it should not be serialized...
          Hide
          Uwe Schindler added a comment -

          Here the patch with the suggested changes.
          TrieRange test changes and so on come later.

          Show
          Uwe Schindler added a comment - Here the patch with the suggested changes. TrieRange test changes and so on come later.
          Hide
          Michael McCandless added a comment -

          Looks good Uwe, thanks. I plan to commit shortly. I changed one sentence in the javadoc to:

          * If you re-use the same query instance for another
          * search, be sure to first reset the term counter
          * with {@link #clearTotalNumberOfTerms}.
          

          (It was missing the "reuse" part, making it sound like you always must call clear first).

          Show
          Michael McCandless added a comment - Looks good Uwe, thanks. I plan to commit shortly. I changed one sentence in the javadoc to: * If you re-use the same query instance for another * search, be sure to first reset the term counter * with {@link #clearTotalNumberOfTerms}. (It was missing the "reuse" part, making it sound like you always must call clear first).
          Hide
          Michael McCandless added a comment -

          Thanks Uwe!

          Show
          Michael McCandless added a comment - Thanks Uwe!
          Hide
          Uwe Schindler added a comment -

          Just for completeness: Should we add the number of terms methods also to the *Filter counterparts of the MultiTermQueries (as the new methods are only automatically appear in subclasses, but not in related pass-to-query-only classes)?
          In trie-range I have these pass-to-query methods.

          Show
          Uwe Schindler added a comment - Just for completeness: Should we add the number of terms methods also to the *Filter counterparts of the MultiTermQueries (as the new methods are only automatically appear in subclasses, but not in related pass-to-query-only classes)? In trie-range I have these pass-to-query methods.
          Hide
          Uwe Schindler added a comment -

          An idea how to achive this, would be to make the *Filter counterparts subclasses of a new superclass MutiTermFilter, that just passes all methods to the corresponding query. This would make the filter classes simplier and one must only add the new methods to this superclass not to every filter class manually.
          Should I add a patch?

          Show
          Uwe Schindler added a comment - An idea how to achive this, would be to make the *Filter counterparts subclasses of a new superclass MutiTermFilter, that just passes all methods to the corresponding query. This would make the filter classes simplier and one must only add the new methods to this superclass not to every filter class manually. Should I add a patch?
          Hide
          Michael McCandless added a comment -

          An idea how to achive this, would be to make the *Filter counterparts subclasses of a new superclass MutiTermFilter, that just passes all methods to the corresponding query.

          That'd be great – can you reopen this & attach patch?

          Show
          Michael McCandless added a comment - An idea how to achive this, would be to make the *Filter counterparts subclasses of a new superclass MutiTermFilter, that just passes all methods to the corresponding query. That'd be great – can you reopen this & attach patch?
          Hide
          Uwe Schindler added a comment -

          To also expose getTotalTermCount to the Filter counterparts, I provide a new supercalss for those filters MultiTermQueryWrapperFilter (name to be discussed).

          Show
          Uwe Schindler added a comment - To also expose getTotalTermCount to the Filter counterparts, I provide a new supercalss for those filters MultiTermQueryWrapperFilter (name to be discussed).
          Hide
          Uwe Schindler added a comment -

          Patch that provides that filter.

          Mike: Can you please test drop-in-backwards-compatibility with this patch applied, with Windows, the checkout through Ant does not work correctly? I also set the native line ending svn property of the new file in the patch.

          The update of the TrieRange follows after this is committed, code not affected, compiles still fine.

          Show
          Uwe Schindler added a comment - Patch that provides that filter. Mike: Can you please test drop-in-backwards-compatibility with this patch applied, with Windows, the checkout through Ant does not work correctly? I also set the native line ending svn property of the new file in the patch. The update of the TrieRange follows after this is committed, code not affected, compiles still fine.
          Hide
          Michael McCandless added a comment -

          Patch looks good – I'll commit shortly. Thanks Uwe!

          Show
          Michael McCandless added a comment - Patch looks good – I'll commit shortly. Thanks Uwe!
          Hide
          Uwe Schindler added a comment -

          Do you think the name is good? MultiTermQueryWrapperFilter or simplier MultiTermFilter? Its not really one of both, its a mix between wrapper and the real filter: It wraps the query, but does the getDocIdSet and TermEnums himself.

          Show
          Uwe Schindler added a comment - Do you think the name is good? MultiTermQueryWrapperFilter or simplier MultiTermFilter? Its not really one of both, its a mix between wrapper and the real filter: It wraps the query, but does the getDocIdSet and TermEnums himself.
          Hide
          Michael McCandless added a comment -

          I think the name is good, so it's clear you have to provide a MultiTermQuery yourself (via subclass) to use it.

          Show
          Michael McCandless added a comment - I think the name is good, so it's clear you have to provide a MultiTermQuery yourself (via subclass) to use it.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development