Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.1
    • Fix Version/s: 5.3, 6.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This patch factors out common indexing/search API used by the recently introduced NRTSuggester
      The motivation is to provide a query interface for FST-based fields (SuggestField and ContextSuggestField)
      to enable suggestion scoring and more powerful automaton queries.

      Previously, only prefix ‘queries’ with index-time weights were supported but we can also support:

      • Prefix queries expressed as regular expressions: get suggestions that match multiple prefixes
        • Example: star[wa|tr] matches starwars and startrek
      • Fuzzy Prefix queries supporting scoring: get typo tolerant suggestions scored by how close they are to the query prefix
        • Example: querying for seper will score separate higher then superstitious
      • Context Queries: get suggestions boosted and/or filtered based on their indexed contexts (meta data)
        • Boost example: get typo tolerant suggestions on song names with prefix like a roling boosting songs with
          genre rock and indie
        • Filter example: get suggestion on all file names starting with finan only for user1 and user2

      Suggest API

      SuggestIndexSearcher searcher = new SuggestIndexSearcher(reader);
      CompletionQuery query = ...
      TopSuggestDocs suggest = searcher.suggest(query, num);
      

      CompletionQuery

      CompletionQuery is used to query SuggestField and ContextSuggestField. A CompletionQuery produces a CompletionWeight,
      which allows CompletionQuery implementations to pass in an automaton that will be intersected with a FST and allows boosting and
      meta data extraction from the intersected partial paths. A CompletionWeight produces a CompletionScorer. A CompletionScorer
      executes a Top N search against the FST with the provided automaton, scoring and filtering all matched paths.

      PrefixCompletionQuery

      Return documents with values that match the prefix of an analyzed term text
      Documents are sorted according to their suggest field weight.

      PrefixCompletionQuery(Analyzer analyzer, Term term)
      

      RegexCompletionQuery

      Return documents with values that match the prefix of a regular expression
      Documents are sorted according to their suggest field weight.

      RegexCompletionQuery(Term term)
      

      FuzzyCompletionQuery

      Return documents with values that has prefixes within a specified edit distance of an analyzed term text.
      Documents are ‘boosted’ by the number of matching prefix letters of the suggestion with respect to the original term text.

      FuzzyCompletionQuery(Analyzer analyzer, Term term)
      
      Scoring

      suggestion_weight * boost
      where suggestion_weight and boost are all integers.
      boost = # of prefix characters matched

      ContextQuery

      Return documents that match a CompletionQuery filtered and/or boosted by provided context(s).

      ContextQuery(CompletionQuery query)
      contextQuery.addContext(CharSequence context, int boost, boolean exact)
      

      NOTE: ContextQuery should be used with ContextSuggestField to query suggestions boosted and/or filtered by contexts.
      Running ContextQuery against a SuggestField will error out.

      Scoring

      suggestion_weight * context_boost
      where suggestion_weight and context_boost are all integers

      When used with FuzzyCompletionQuery,
      suggestion_weight * (context_boost + fuzzy_boost)

      Context Suggest Field

      To use ContextQuery, use ContextSuggestField instead of SuggestField. Any CompletionQuery can be used with
      ContextSuggestField, the default behaviour is to return suggestions from all contexts. Context for every completion hit
      can be accessed through SuggestScoreDoc#context.

      ContextSuggestField(String name, Collection<CharSequence> contexts, String value, int weight) 
      
      1. LUCENE-6459.patch
        227 kB
        Areek Zillur
      2. LUCENE-6459.patch
        227 kB
        Areek Zillur
      3. LUCENE-6459.patch
        227 kB
        Areek Zillur
      4. LUCENE-6459.patch
        227 kB
        Areek Zillur
      5. LUCENE-6459.patch
        241 kB
        Areek Zillur
      6. LUCENE-6459.patch
        241 kB
        Areek Zillur
      7. LUCENE-6459.patch
        172 kB
        Areek Zillur
      8. LUCENE-6459.patch
        171 kB
        Areek Zillur
      9. LUCENE-6459.patch
        170 kB
        Areek Zillur
      10. LUCENE-6459.patch
        168 kB
        Areek Zillur
      11. LUCENE-6459.patch
        167 kB
        Areek Zillur
      12. LUCENE-6459.patch
        166 kB
        Areek Zillur
      13. LUCENE-6459.patch
        161 kB
        Areek Zillur
      14. LUCENE-6459.patch
        154 kB
        Areek Zillur

        Activity

        Hide
        Areek Zillur added a comment -

        Initial patch. It would be awesome to get some feedback on this!

        Show
        Areek Zillur added a comment - Initial patch. It would be awesome to get some feedback on this!
        Hide
        Areek Zillur added a comment -

        Updated Patch:

        • added documentation
        • minor cleanup
        • added more tests
        Show
        Areek Zillur added a comment - Updated Patch: added documentation minor cleanup added more tests
        Hide
        Areek Zillur added a comment -

        Updated Patch:

        • minor cleanup
        • added tests
        Show
        Areek Zillur added a comment - Updated Patch: minor cleanup added tests
        Hide
        Michael McCandless added a comment -

        Thanks Areek Zillur, this is a big patch! I'm trying to wrap my brain around it... it will
        take some iterations

        The filled-out javadocs are nice.

        It seems like the overall idea is to make a generic index-time and
        search-time API that other suggesters could use, but for now it's just
        NRTSuggester using it? Do we expect the non-document based suggesters
        to also eventually be able to use this API?

        One thing we are struggling with on AnalyzingInfixSuggester is how to
        allow arbitrary context filters (LUCENE-6464) ... I wonder if this API
        would make that easier.

        This patch also adds new capabilities to NRTSuggester, like fuzzy and
        regexp suggestions? What other new functions are exposed? What
        use-cases do you see for RegexCompletionQuery?

        Should FSTPath.toString also include the context?

        The search side largely mirrors/subclasses Lucene's normal search
        classes (CompletionQuery, CompletionWeight, SuggestIndexSearcher,
        etc.) but there are some differences, e.g. we pass an Analyzer to the
        completion queries (so they can build the automaton).

        If you try to use ContextQuery against a field that you had not
        indexed contexts with (using ContextSuggestField) do you see any
        error? Maybe this is too hard.

        Are you allowed to mix ContextSuggestField and SuggestField even for
        the same field name, within one suggester?

        Show
        Michael McCandless added a comment - Thanks Areek Zillur , this is a big patch! I'm trying to wrap my brain around it... it will take some iterations The filled-out javadocs are nice. It seems like the overall idea is to make a generic index-time and search-time API that other suggesters could use, but for now it's just NRTSuggester using it? Do we expect the non-document based suggesters to also eventually be able to use this API? One thing we are struggling with on AnalyzingInfixSuggester is how to allow arbitrary context filters ( LUCENE-6464 ) ... I wonder if this API would make that easier. This patch also adds new capabilities to NRTSuggester, like fuzzy and regexp suggestions? What other new functions are exposed? What use-cases do you see for RegexCompletionQuery? Should FSTPath.toString also include the context? The search side largely mirrors/subclasses Lucene's normal search classes (CompletionQuery, CompletionWeight, SuggestIndexSearcher, etc.) but there are some differences, e.g. we pass an Analyzer to the completion queries (so they can build the automaton). If you try to use ContextQuery against a field that you had not indexed contexts with (using ContextSuggestField) do you see any error? Maybe this is too hard. Are you allowed to mix ContextSuggestField and SuggestField even for the same field name, within one suggester?
        Hide
        Areek Zillur added a comment - - edited

        Thanks Michael McCandless for taking a look!

        It seems like the overall idea is to make a generic index-time and
        search-time API that other suggesters could use, but for now it's just
        NRTSuggester using it? Do we expect the non-document based suggesters
        to also eventually be able to use this API?

        At the moment, only NRTSuggester is using it. We should use this API for other suggesters,
        but maybe in a separate issue . For LUCENE-6464, the existing ContextQuery needs
        to add support for BooleanClause.Occur.

        This patch also adds new capabilities to NRTSuggester, like fuzzy and
        regexp suggestions? What other new functions are exposed? What
        use-cases do you see for RegexCompletionQuery?

        In terms of new functionality, fuzzy, regex and context queries are added. One thing to
        note, for fuzzy and context queries, the suggestion scores are influenced by their common
        prefix length (w.r.t. query term) and query-time context boosts respectively,
        along with index-time suggestion weights. IMO, the way a suggestion is scored currently
        needs some thought. As we only allow integer index-time weights, one possibility would
        be to use index-weight + (Int.MAX * boost) instead of using MaxWeight of suggestions?

        IMO RegexCompletionQuery can be used to query multiple prefixes at one go. This can allow
        for simpler query analyzer, but still give the power to query for synonyms, domain-specific typos etc.
        In the future, we can also add boosting (like in ContextQuery) where query-time boosts can be
        specified for some matched prefixes of the regex pattern.

        Should FSTPath.toString also include the context?

        It should, will change.

        but there are some differences, e.g. we pass an Analyzer to the
        completion queries (so they can build the automaton)

        open to suggestions on improving this

        If you try to use ContextQuery against a field that you had not
        indexed contexts with (using ContextSuggestField) do you see any
        error? Maybe this is too hard.

        There should not be any error. A ContextQuery will never be run on a SuggestField,
        CompletionQuery rewrites appropriately given the type of the field (context-enabled or not).
        This also makes non-context queries work as expected when run against ContextSuggestField
        (as in the query is wrapped as a ContextQuery with no context filtering/boosting).

        If a ContextSuggestField is indexed with no context, then a null context is extracted at query
        time for the entry. Fields with no context will only be returned, if a wildcard context '*' is
        specified (default behaviour of ContextQuery).

        Are you allowed to mix ContextSuggestField and SuggestField even for
        the same field name, within one suggester?

        No you are not. If mixed, CompletionFieldsConsumer will throw IllegalArgumentException
        upon indexing.

        Show
        Areek Zillur added a comment - - edited Thanks Michael McCandless for taking a look! It seems like the overall idea is to make a generic index-time and search-time API that other suggesters could use, but for now it's just NRTSuggester using it? Do we expect the non-document based suggesters to also eventually be able to use this API? At the moment, only NRTSuggester is using it. We should use this API for other suggesters, but maybe in a separate issue . For LUCENE-6464 , the existing ContextQuery needs to add support for BooleanClause.Occur . This patch also adds new capabilities to NRTSuggester, like fuzzy and regexp suggestions? What other new functions are exposed? What use-cases do you see for RegexCompletionQuery? In terms of new functionality, fuzzy, regex and context queries are added. One thing to note, for fuzzy and context queries, the suggestion scores are influenced by their common prefix length (w.r.t. query term) and query-time context boosts respectively, along with index-time suggestion weights. IMO, the way a suggestion is scored currently needs some thought. As we only allow integer index-time weights, one possibility would be to use index-weight + (Int.MAX * boost) instead of using MaxWeight of suggestions? IMO RegexCompletionQuery can be used to query multiple prefixes at one go. This can allow for simpler query analyzer, but still give the power to query for synonyms, domain-specific typos etc. In the future, we can also add boosting (like in ContextQuery) where query-time boosts can be specified for some matched prefixes of the regex pattern. Should FSTPath.toString also include the context? It should, will change. but there are some differences, e.g. we pass an Analyzer to the completion queries (so they can build the automaton) open to suggestions on improving this If you try to use ContextQuery against a field that you had not indexed contexts with (using ContextSuggestField) do you see any error? Maybe this is too hard. There should not be any error. A ContextQuery will never be run on a SuggestField, CompletionQuery rewrites appropriately given the type of the field (context-enabled or not). This also makes non-context queries work as expected when run against ContextSuggestField (as in the query is wrapped as a ContextQuery with no context filtering/boosting). If a ContextSuggestField is indexed with no context, then a null context is extracted at query time for the entry. Fields with no context will only be returned, if a wildcard context '*' is specified (default behaviour of ContextQuery). Are you allowed to mix ContextSuggestField and SuggestField even for the same field name, within one suggester? No you are not. If mixed, CompletionFieldsConsumer will throw IllegalArgumentException upon indexing.
        Hide
        Areek Zillur added a comment -

        Updated patch:

        • added test querying empty contexts with boosts
        • add context to FSTPath.toString
        Show
        Areek Zillur added a comment - Updated patch: added test querying empty contexts with boosts add context to FSTPath.toString
        Hide
        Areek Zillur added a comment -

        Added test for mixing field types for same field.

        Show
        Areek Zillur added a comment - Added test for mixing field types for same field.
        Hide
        Areek Zillur added a comment -

        Updated Patch:

        • don't allow zero length suggestion values
        • improve docs
        • added tests
        Show
        Areek Zillur added a comment - Updated Patch: don't allow zero length suggestion values improve docs added tests
        Hide
        Michael McCandless added a comment -

        We should use this API for other suggesters, but maybe in a separate issue

        Yes, definitely separate!

        In terms of new functionality, fuzzy, regex and context queries are added.

        OK, cool, these are nice additions.

        query-time context boosts

        Cool: so you boost some contexts more than others, using ContextQuery.

        As we only allow integer index-time weights

        I thought we accept long (not int) as index-time weight? (But, I
        think that really is overkill... maybe they should just floats, like
        per-field boosting at index time). But we can worry about this
        later...

        one possibility would be to use index-weight + (Int.MAX * boost) instead of using MaxWeight of suggestions

        Sorry I don't understand the idea here?

        If you try to use ContextQuery against a field that you had not indexed contexts with (using ContextSuggestField) do you see any error? Maybe this is too hard.

        There should not be any error. A ContextQuery will never be run on a SuggestField

        It seems like we could detect this mis-use, since CompletionTerms seems to know whether the field was indexed with contexts or not? I.e, if I accidentally try to run a ContextQuery against a field indexed with only SuggestField, it seems like I should get an exception saying I screwed up ... (similar to trying to run a PhraseQuery on a field that did not index positions)? Maybe add a simple test case?

        A ContextQuery will never be run on a SuggestField,
        CompletionQuery rewrites appropriately given the type of the field (context-enabled or not).

        OK maybe at that rewrite is the time to throw the exc?

        This also makes non-context queries work as expected when run against ContextSuggestField
        (as in the query is wrapped as a ContextQuery with no context filtering/boosting).

        OK, for that direction it makes sense allow ... good.

        Are you allowed to mix ContextSuggestField and SuggestField even for the same field name, within one suggester?

        No you are not. If mixed, CompletionFieldsConsumer will throw IllegalArgumentException upon indexing.

        OK, excellent.

        Can we rename TopSuggestDocsCollector.num() to maybe .getCountToCollect or something a bit more verbose?

        Net/net this is a nice change, thanks Areek Zillur!

        Show
        Michael McCandless added a comment - We should use this API for other suggesters, but maybe in a separate issue Yes, definitely separate! In terms of new functionality, fuzzy, regex and context queries are added. OK, cool, these are nice additions. query-time context boosts Cool: so you boost some contexts more than others, using ContextQuery. As we only allow integer index-time weights I thought we accept long (not int) as index-time weight? (But, I think that really is overkill... maybe they should just floats, like per-field boosting at index time). But we can worry about this later... one possibility would be to use index-weight + (Int.MAX * boost) instead of using MaxWeight of suggestions Sorry I don't understand the idea here? If you try to use ContextQuery against a field that you had not indexed contexts with (using ContextSuggestField) do you see any error? Maybe this is too hard. There should not be any error. A ContextQuery will never be run on a SuggestField It seems like we could detect this mis-use, since CompletionTerms seems to know whether the field was indexed with contexts or not? I.e, if I accidentally try to run a ContextQuery against a field indexed with only SuggestField, it seems like I should get an exception saying I screwed up ... (similar to trying to run a PhraseQuery on a field that did not index positions)? Maybe add a simple test case? A ContextQuery will never be run on a SuggestField, CompletionQuery rewrites appropriately given the type of the field (context-enabled or not). OK maybe at that rewrite is the time to throw the exc? This also makes non-context queries work as expected when run against ContextSuggestField (as in the query is wrapped as a ContextQuery with no context filtering/boosting). OK, for that direction it makes sense allow ... good. Are you allowed to mix ContextSuggestField and SuggestField even for the same field name, within one suggester? No you are not. If mixed, CompletionFieldsConsumer will throw IllegalArgumentException upon indexing. OK, excellent. Can we rename TopSuggestDocsCollector.num() to maybe .getCountToCollect or something a bit more verbose? Net/net this is a nice change, thanks Areek Zillur !
        Hide
        Areek Zillur added a comment -

        Thanks Michael McCandless for the feedback

        I thought we accept long (not int) as index-time weight? (But, I
        think that really is overkill... maybe they should just floats, like
        per-field boosting at index time)

        IMO, a suggestion weight is just an index-time boost for the
        associated entry.

        one possibility would be to use index-weight + (Int.MAX * boost) instead of using MaxWeight of suggestions

        Sorry I don't understand the idea here?

        After a query automaton has been intersected with the FST in NRTSuggester,
        boosts and/or context is computed/extracted from each of the partial matched paths
        by the CompletionWeight before performing a TopN search.
        For example, FuzzyCompletionWeight would count the number of prefix characters
        a matched input has w.r.t. the analyzed query prefix and set the boost for it to the
        number of common prefix length.
        Calculating a suggestion score of weight + (maxWeight * boost) makes sure that
        entries with a higher boost (longer common prefix w.r.t. query prefix) will always be
        scored higher regardless of the index-time weight of suggestion entries. The
        segment-level maxWeight is stored in CompletionPostingsFormat (CompletionIndex),
        and the maxWeight is computed across all segments at query-time.
        Since, the maximum weight for any suggestion entry will be <= Integer.MAX_VALUE, we
        can just replace the maxWeight for a suggestField with Integer.MAX_VALUE? One problem
        might be the loss of precision when converting the long score to a float?

        It seems like we could detect this mis-use, since CompletionTerms seems to know whether the field was indexed with contexts or not? I.e, if I accidentally try to run a ContextQuery against a field indexed with only SuggestField, it seems like I should get an exception saying I screwed up ... (similar to trying to run a PhraseQuery on a field that did not index positions)? Maybe add a simple test case?

        I updated the patch to error out when using a ContextQuery against a SuggestField at rewrite with test.

        Can we rename TopSuggestDocsCollector.num() to maybe .getCountToCollect or something a bit more verbose?

        Changed TopSuggestDocsCollector.num() to getCountToCollect()

        Show
        Areek Zillur added a comment - Thanks Michael McCandless for the feedback I thought we accept long (not int) as index-time weight? (But, I think that really is overkill... maybe they should just floats, like per-field boosting at index time) IMO, a suggestion weight is just an index-time boost for the associated entry. one possibility would be to use index-weight + (Int.MAX * boost) instead of using MaxWeight of suggestions Sorry I don't understand the idea here? After a query automaton has been intersected with the FST in NRTSuggester , boosts and/or context is computed/extracted from each of the partial matched paths by the CompletionWeight before performing a TopN search. For example, FuzzyCompletionWeight would count the number of prefix characters a matched input has w.r.t. the analyzed query prefix and set the boost for it to the number of common prefix length. Calculating a suggestion score of weight + (maxWeight * boost) makes sure that entries with a higher boost (longer common prefix w.r.t. query prefix) will always be scored higher regardless of the index-time weight of suggestion entries. The segment-level maxWeight is stored in CompletionPostingsFormat (CompletionIndex), and the maxWeight is computed across all segments at query-time. Since, the maximum weight for any suggestion entry will be <= Integer.MAX_VALUE, we can just replace the maxWeight for a suggestField with Integer.MAX_VALUE? One problem might be the loss of precision when converting the long score to a float? It seems like we could detect this mis-use, since CompletionTerms seems to know whether the field was indexed with contexts or not? I.e, if I accidentally try to run a ContextQuery against a field indexed with only SuggestField, it seems like I should get an exception saying I screwed up ... (similar to trying to run a PhraseQuery on a field that did not index positions)? Maybe add a simple test case? I updated the patch to error out when using a ContextQuery against a SuggestField at rewrite with test. Can we rename TopSuggestDocsCollector.num() to maybe .getCountToCollect or something a bit more verbose? Changed TopSuggestDocsCollector.num() to getCountToCollect()
        Hide
        Areek Zillur added a comment -

        Updated Patch:

        • pre-sort context length in ContextQuery
        • added test to ensure longer context takes precedence for boosting
        Show
        Areek Zillur added a comment - Updated Patch: pre-sort context length in ContextQuery added test to ensure longer context takes precedence for boosting
        Hide
        Michael McCandless added a comment -

        IMO, a suggestion weight is just an index-time boost for the associated entry.

        +1

        Shouldn't we switch to float (later, separate issue!)?

        Calculating a suggestion score of weight + (maxWeight * boost) makes sure that
        entries with a higher boost (longer common prefix w.r.t. query prefix) will always be
        scored higher regardless of the index-time weight of suggestion entries. The
        segment-level maxWeight is stored in CompletionPostingsFormat (CompletionIndex),
        and the maxWeight is computed across all segments at query-time.
        Since, the maximum weight for any suggestion entry will be <= Integer.MAX_VALUE, we
        can just replace the maxWeight for a suggestField with Integer.MAX_VALUE? One problem
        might be the loss of precision when converting the long score to a float?

        OK I think I understand! It's a scoring model that guarantees that
        the search-time score ranking comes first and index time ranking is
        used only for tie-breaking the search-time score.

        But, I'm not sure that's a good goal? E.g I may not want a very low
        index-time boosted suggestion that has more shared prefix to score
        higher than a highly index-time boosted suggestion that matched a bit
        smaller prefix?

        E.g. when I type "pyh" into google, the top suggestion is "python"
        (prefix=2) and after that is "pyhs2" (prefix=3).

        Anyway, I think scoring will forever be a challenging topic and we
        just have to keep it pluggable (CompletionScorer.score).

        Hmm ant precommit is failing for me:

        -documentation-lint:
             [echo] checking for broken html...
            [jtidy] Checking for broken html (such as invalid tags)...
           [delete] Deleting directory /l/areek/lucene/build/jtidy_tmp
             [echo] Checking for broken links...
             [exec] 
             [exec] Crawl/parse...
             [exec] 
             [exec] Verify...
             [exec] 
             [exec] file:///build/docs/suggest/org/apache/lucene/search/suggest/document/CompletionTerms.html
             [exec]   BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.CompletionFieldsProducer.CompletionsTermsReader.html
             [exec]   BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.NRTSuggester.html
             [exec]   BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.NRTSuggester.html
             [exec]   BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.CompletionFieldsProducer.CompletionsTermsReader.html
             [exec]   BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.NRTSuggester.html
             [exec]   BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.NRTSuggester.html
             [exec] 
             [exec] file:///build/docs/suggest/org/apache/lucene/search/suggest/document/SuggestField.html
             [exec]   BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.CompletionTokenStream.html
             [exec]   BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.CompletionTokenStream.html
             [exec] 
             [exec] file:///build/docs/suggest/org/apache/lucene/search/suggest/document/class-use/CompletionWeight.html
             [exec]   BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.NRTSuggester.html
             [exec] 
             [exec] file:///build/docs/suggest/org/apache/lucene/search/suggest/document/CompletionScorer.html
             [exec]   BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.NRTSuggester.html
             [exec]   BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.NRTSuggester.html
             [exec] 
             [exec] file:///build/docs/suggest/org/apache/lucene/search/suggest/document/ContextSuggestField.html
             [exec]   BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.CompletionTokenStream.html
             [exec]   BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.CompletionTokenStream.html
             [exec] 
             [exec] Broken javadocs links were found!
        
        Show
        Michael McCandless added a comment - IMO, a suggestion weight is just an index-time boost for the associated entry. +1 Shouldn't we switch to float (later, separate issue!)? Calculating a suggestion score of weight + (maxWeight * boost) makes sure that entries with a higher boost (longer common prefix w.r.t. query prefix) will always be scored higher regardless of the index-time weight of suggestion entries. The segment-level maxWeight is stored in CompletionPostingsFormat (CompletionIndex), and the maxWeight is computed across all segments at query-time. Since, the maximum weight for any suggestion entry will be <= Integer.MAX_VALUE, we can just replace the maxWeight for a suggestField with Integer.MAX_VALUE? One problem might be the loss of precision when converting the long score to a float? OK I think I understand! It's a scoring model that guarantees that the search-time score ranking comes first and index time ranking is used only for tie-breaking the search-time score. But, I'm not sure that's a good goal? E.g I may not want a very low index-time boosted suggestion that has more shared prefix to score higher than a highly index-time boosted suggestion that matched a bit smaller prefix? E.g. when I type "pyh" into google, the top suggestion is "python" (prefix=2) and after that is "pyhs2" (prefix=3). Anyway, I think scoring will forever be a challenging topic and we just have to keep it pluggable (CompletionScorer.score). Hmm ant precommit is failing for me: -documentation-lint: [echo] checking for broken html... [jtidy] Checking for broken html (such as invalid tags)... [delete] Deleting directory /l/areek/lucene/build/jtidy_tmp [echo] Checking for broken links... [exec] [exec] Crawl/parse... [exec] [exec] Verify... [exec] [exec] file:///build/docs/suggest/org/apache/lucene/search/suggest/document/CompletionTerms.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.CompletionFieldsProducer.CompletionsTermsReader.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.NRTSuggester.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.NRTSuggester.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.CompletionFieldsProducer.CompletionsTermsReader.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.NRTSuggester.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.NRTSuggester.html [exec] [exec] file:///build/docs/suggest/org/apache/lucene/search/suggest/document/SuggestField.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.CompletionTokenStream.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.CompletionTokenStream.html [exec] [exec] file:///build/docs/suggest/org/apache/lucene/search/suggest/document/class-use/CompletionWeight.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.NRTSuggester.html [exec] [exec] file:///build/docs/suggest/org/apache/lucene/search/suggest/document/CompletionScorer.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.NRTSuggester.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.NRTSuggester.html [exec] [exec] file:///build/docs/suggest/org/apache/lucene/search/suggest/document/ContextSuggestField.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.CompletionTokenStream.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/search/suggest.document.CompletionTokenStream.html [exec] [exec] Broken javadocs links were found!
        Hide
        Areek Zillur added a comment - - edited

        Thanks Michael McCandless for the feedback!

        Anyway, I think scoring will forever be a challenging topic and we
        just have to keep it pluggable (CompletionScorer.score).

        I have made the CompletionScorer.score pluggable by extracting a ScoreMode interface, which
        computes a score for a suggestion using its weight and boost (along with min/max weight stats).

        public interface ScoreMode {
          float score(float weight, float boost, float minWeight, float maxWeight);
        }
        

        Currently there are three implementations:

        • BOOST (weight * boost),
        • IGNORE_BOOST (weight)
        • BOOST_FIRST (weight + (maxWeight * boost))

        Now any CompletionQuery can plug in one of the score modes (or custom implementation)
        through CompletionQuery.setScoreMode.
        Default score mode for existing queries:

        • PrefixCompletionQuery & RegexCompletionQuery - IGNORE_BOOST
        • FuzzyCompletionQuery - BOOST_FIRST
        • ContextQuery - BOOST

        ScoreMode is also used in the comparator for the TopNSearcher queue.

        It would be good to know your thoughts on how the TopNSearcher queue size is calculated here.
        Now that every intersected prefix path might have a different boost and this boost influences the top N,
        the topN for the queue has been changed to n * num_prefix_paths from just n to allow for n path
        expansions per intersected prefix path. In reality, search execution will early terminate as soon as the
        original topN results have been collected.

        Updated Patch:

        • Added ScoreMode for pluggable scoring
        • Increase test coverage
        • Added dedicated tests for PrefixCompletionQuery, RegexCompletionQuery,
          FuzzyCompletionQuery and ContextQuery
        • Fixed javadocs, making ant precommit happy
        Show
        Areek Zillur added a comment - - edited Thanks Michael McCandless for the feedback! Anyway, I think scoring will forever be a challenging topic and we just have to keep it pluggable (CompletionScorer.score). I have made the CompletionScorer.score pluggable by extracting a ScoreMode interface, which computes a score for a suggestion using its weight and boost (along with min/max weight stats). public interface ScoreMode { float score( float weight, float boost, float minWeight, float maxWeight); } Currently there are three implementations: BOOST (weight * boost), IGNORE_BOOST (weight) BOOST_FIRST (weight + (maxWeight * boost)) Now any CompletionQuery can plug in one of the score modes (or custom implementation) through CompletionQuery.setScoreMode . Default score mode for existing queries: PrefixCompletionQuery & RegexCompletionQuery - IGNORE_BOOST FuzzyCompletionQuery - BOOST_FIRST ContextQuery - BOOST ScoreMode is also used in the comparator for the TopNSearcher queue. It would be good to know your thoughts on how the TopNSearcher queue size is calculated here. Now that every intersected prefix path might have a different boost and this boost influences the top N, the topN for the queue has been changed to n * num_prefix_paths from just n to allow for n path expansions per intersected prefix path. In reality, search execution will early terminate as soon as the original topN results have been collected. Updated Patch: Added ScoreMode for pluggable scoring Increase test coverage Added dedicated tests for PrefixCompletionQuery, RegexCompletionQuery, FuzzyCompletionQuery and ContextQuery Fixed javadocs, making ant precommit happy
        Hide
        Areek Zillur added a comment -

        Updated patch:

        • removed ScoreMode interface (will add it in a separate issue)
        • added more tests
        Show
        Areek Zillur added a comment - Updated patch: removed ScoreMode interface (will add it in a separate issue) added more tests
        Hide
        Michael McCandless added a comment -

        Thanks Areek Zillur, new patch looks great, passes ant precommit ... I'll commit soon.

        Show
        Michael McCandless added a comment - Thanks Areek Zillur , new patch looks great, passes ant precommit ... I'll commit soon.
        Hide
        Areek Zillur added a comment -

        Added changes entry

        Show
        Areek Zillur added a comment - Added changes entry
        Hide
        Areek Zillur added a comment -

        Added changes entry

        Show
        Areek Zillur added a comment - Added changes entry
        Hide
        Areek Zillur added a comment -

        minor cleanup (don't throw redundant IOException in CompletionQuery ctor)

        Show
        Areek Zillur added a comment - minor cleanup (don't throw redundant IOException in CompletionQuery ctor)
        Hide
        ASF subversion and git services added a comment -

        Commit 1682158 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1682158 ]

        LUCENE-6459: add common suggest API for document based NRT suggester

        Show
        ASF subversion and git services added a comment - Commit 1682158 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1682158 ] LUCENE-6459 : add common suggest API for document based NRT suggester
        Hide
        ASF subversion and git services added a comment -

        Commit 1682170 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1682170 ]

        LUCENE-6459: add common suggest API for document based NRT suggester

        Show
        ASF subversion and git services added a comment - Commit 1682170 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1682170 ] LUCENE-6459 : add common suggest API for document based NRT suggester
        Hide
        ASF subversion and git services added a comment -

        Commit 1682172 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1682172 ]

        LUCENE-6459: carry forward changes from backport

        Show
        ASF subversion and git services added a comment - Commit 1682172 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1682172 ] LUCENE-6459 : carry forward changes from backport
        Hide
        Michael McCandless added a comment -

        Thanks Areek Zillur, I committed the last patch!

        I ran PMD linter and it spotted a few issues:

        lucene/suggest/src/java/org/apache/lucene/search/suggest/document/CompletionWeight.java:52:	Avoid unused private fields such as 'maxWeight'.
        lucene/suggest/src/java/org/apache/lucene/search/suggest/document/CompletionWeight.java:53:	Avoid unused private fields such as 'minWeight'.
        
        lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggester.java:97:	Avoid unused private fields such as 'endByte'.
        lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggester.java:173:	Avoid unused local variables such as 'search'.
        

        Can you look into them?

        Also, the API to add ContextSuggestField is somewhat annnoying with
        Java 7, e.g.:

            document.add(new ContextSuggestField("context_suggest_field", Collections.<CharSequence>singletonList("type1"), "suggestion1", 4));
        

        Maybe we can improve this to take a varargs?

        Show
        Michael McCandless added a comment - Thanks Areek Zillur , I committed the last patch! I ran PMD linter and it spotted a few issues: lucene/suggest/src/java/org/apache/lucene/search/suggest/document/CompletionWeight.java:52: Avoid unused private fields such as 'maxWeight'. lucene/suggest/src/java/org/apache/lucene/search/suggest/document/CompletionWeight.java:53: Avoid unused private fields such as 'minWeight'. lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggester.java:97: Avoid unused private fields such as 'endByte'. lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggester.java:173: Avoid unused local variables such as 'search'. Can you look into them? Also, the API to add ContextSuggestField is somewhat annnoying with Java 7, e.g.: document.add(new ContextSuggestField("context_suggest_field", Collections.<CharSequence>singletonList("type1"), "suggestion1", 4)); Maybe we can improve this to take a varargs?
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

          People

          • Assignee:
            Areek Zillur
            Reporter:
            Areek Zillur
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development