Lucene - Core
  1. Lucene - Core
  2. LUCENE-5354

Blended score in AnalyzingInfixSuggester

    Details

    • Lucene Fields:
      New

      Description

      I'm working on a custom suggester derived from the AnalyzingInfix. I require what is called a "blended score" (//TODO ln.399 in AnalyzingInfixSuggester) to transform the suggestion weights depending on the position of the searched term(s) in the text.

      Right now, I'm using an easy solution :
      If I want 10 suggestions, then I search against the current ordered index for the 100 first results and transform the weight :

      a) by using the term position in the text (found with TermVector and DocsAndPositionsEnum)

      or

      b) by multiplying the weight by the score of a SpanQuery that I add when searching

      and return the updated 10 most weighted suggestions.

      Since we usually don't need to suggest so many things, the bigger search + rescoring overhead is not so significant but I agree that this is not the most elegant solution.
      We could include this factor (here the position of the term) directly into the index.

      So, I can contribute to this if you think it's worth adding it.

      Do you think I should tweak AnalyzingInfixSuggester, subclass it or create a dedicated class ?

      1. LUCENE-5354_2.patch
        22 kB
        Remi Melisson
      2. LUCENE-5354_3.patch
        23 kB
        Remi Melisson
      3. LUCENE-5354_4.patch
        2 kB
        Remi Melisson
      4. LUCENE-5354.patch
        23 kB
        Remi Melisson

        Activity

        Hide
        Michael McCandless added a comment -

        This sounds very useful!

        I think a subclass could work well, if we open up the necessary methods (which Query to run, how to do the search / resort the results)?

        We could make the index-time sorting optional as well? This way you'd build an "ordinary" index, run an "ordinary" query, so you have full flexibility (but at more search-time cost).

        Show
        Michael McCandless added a comment - This sounds very useful! I think a subclass could work well, if we open up the necessary methods (which Query to run, how to do the search / resort the results)? We could make the index-time sorting optional as well? This way you'd build an "ordinary" index, run an "ordinary" query, so you have full flexibility (but at more search-time cost).
        Hide
        Remi Melisson added a comment -

        I attached a first patch which allows blended score based on the position of the search term. It only provides strategy (a) with two options :

        • Linear (-10% for each position) : blended_weight = weight * (1-0.10*position)
        • Reciprocal : blended_weight = weight/(1+position)

        I would also like to add the second strategy (b) by directly using the score, but here is a first attempt.
        Any advice/remarks welcome!

        Show
        Remi Melisson added a comment - I attached a first patch which allows blended score based on the position of the search term. It only provides strategy (a) with two options : Linear (-10% for each position) : blended_weight = weight * (1-0.10*position) Reciprocal : blended_weight = weight/(1+position) I would also like to add the second strategy (b) by directly using the score, but here is a first attempt. Any advice/remarks welcome!
        Hide
        Michael McCandless added a comment -

        Thanks Remi, patch looks great!

        Can you move that boolean finished inside the if (lastToken != null)? (If there was no lastToken then we should not be calling offsetEnd.endOffset).

        Can we leave AnalyzingInfixSuggester with DOCS_ONLY? I.e., open up a method (maybe getTextFieldType?) that the subclass would override and set to DOCS_AND_FREQS_AND_POSITIONS.

        In createCoefficient, instead of splitting the incoming key on space, I think you should ask the analyzer to do so? In fact, since the lookup (in super) already did that (break into tokens, figure out if last token is a "prefix" or not), maybe we can just pass that down to createResult?

        If the query has more than one term, it looks like you only use the first? Maybe instead we should visit all the terms and record which one has the lowest position?

        Have you done any performance testing? Visiting term vectors for each hit can be costly. It should be more performant to pull a DocsAndPositionsEnum up front and then do .advance to each (sorted) docID to get the position ... but this is likely more complex (it inverts the "stride", so you'd do term by term on the outer loop, then
        docs on the inner loop, vs the opposite that you have now).

        key.toString() can be pulled out of the while loop and done once up front.

        Why do you use key.toString().contains(docTerm) for the finished case? Won't that result in false positives, e.g. if key is "foobar" and docTerm is "oba"?

        Can you rewrite the embedded ternary operator in the LookUpComparator to just use simple if statements? I think that's more readable...

        Show
        Michael McCandless added a comment - Thanks Remi, patch looks great! Can you move that boolean finished inside the if (lastToken != null) ? (If there was no lastToken then we should not be calling offsetEnd.endOffset ). Can we leave AnalyzingInfixSuggester with DOCS_ONLY? I.e., open up a method (maybe getTextFieldType?) that the subclass would override and set to DOCS_AND_FREQS_AND_POSITIONS. In createCoefficient, instead of splitting the incoming key on space, I think you should ask the analyzer to do so? In fact, since the lookup (in super) already did that (break into tokens, figure out if last token is a "prefix" or not), maybe we can just pass that down to createResult? If the query has more than one term, it looks like you only use the first? Maybe instead we should visit all the terms and record which one has the lowest position? Have you done any performance testing? Visiting term vectors for each hit can be costly. It should be more performant to pull a DocsAndPositionsEnum up front and then do .advance to each (sorted) docID to get the position ... but this is likely more complex (it inverts the "stride", so you'd do term by term on the outer loop, then docs on the inner loop, vs the opposite that you have now). key.toString() can be pulled out of the while loop and done once up front. Why do you use key.toString().contains(docTerm) for the finished case? Won't that result in false positives, e.g. if key is "foobar" and docTerm is "oba"? Can you rewrite the embedded ternary operator in the LookUpComparator to just use simple if statements? I think that's more readable...
        Hide
        Remi Melisson added a comment -

        Hey Michael, thanks for the in-depth code review!
        I attached another patch which makes things simpler and fixes what you suggested.

        The remaining things are :

        Have you done any performance testing?

        Not really, I've seen that you did some for the infix suggester, but I couldn't find the code. Is there something already or should I test the performance my way ?

        Visiting term vectors for each hit can be costly. It should be more performant to pull a DocsAndPositionsEnum up front and then do .advance to each (sorted) docID to get the position ... but this is likely more complex (it inverts the "stride", so you'd do term by term on the outer loop, then docs on the inner loop, vs the opposite that you have now).

        For now, the only way I know to access the DocsAndPositionsEnum is by getting it from the TermsEnum which implies iterating over the term vector (the doc says "Get DocsAndPositionsEnum for the current term").

        Show
        Remi Melisson added a comment - Hey Michael, thanks for the in-depth code review! I attached another patch which makes things simpler and fixes what you suggested. The remaining things are : Have you done any performance testing? Not really, I've seen that you did some for the infix suggester, but I couldn't find the code. Is there something already or should I test the performance my way ? Visiting term vectors for each hit can be costly. It should be more performant to pull a DocsAndPositionsEnum up front and then do .advance to each (sorted) docID to get the position ... but this is likely more complex (it inverts the "stride", so you'd do term by term on the outer loop, then docs on the inner loop, vs the opposite that you have now). For now, the only way I know to access the DocsAndPositionsEnum is by getting it from the TermsEnum which implies iterating over the term vector (the doc says "Get DocsAndPositionsEnum for the current term").
        Hide
        Remi Melisson added a comment -

        Hi, any news about this feature ?
        Could I do anything else ?

        Show
        Remi Melisson added a comment - Hi, any news about this feature ? Could I do anything else ?
        Hide
        Michael McCandless added a comment -

        Woops, sorry, this fell below the event horizon of my TODO list. I'll look at your new patch soon.

        There is an existing performance test, LookupBenchmarkTest, but it's a bit tricky to run. See the comment on LUCENE-5030: https://issues.apache.org/jira/browse/LUCENE-5030?focusedCommentId=13689155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13689155

        Show
        Michael McCandless added a comment - Woops, sorry, this fell below the event horizon of my TODO list. I'll look at your new patch soon. There is an existing performance test, LookupBenchmarkTest, but it's a bit tricky to run. See the comment on LUCENE-5030 : https://issues.apache.org/jira/browse/LUCENE-5030?focusedCommentId=13689155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13689155
        Hide
        Michael McCandless added a comment -

        New patch looks great, thanks Remi!

        I'm worried about how costly iterating over term vectors is going to be ... are you planning to run the performance test? If not, I can.

        It might be better to open up a protected method to convert the smallest position to the coefficient? The default impl can do the switch based on the BlenderType enum... but apps may want to control how the score is "boosted" by position.

        Show
        Michael McCandless added a comment - New patch looks great, thanks Remi! I'm worried about how costly iterating over term vectors is going to be ... are you planning to run the performance test? If not, I can. It might be better to open up a protected method to convert the smallest position to the coefficient? The default impl can do the switch based on the BlenderType enum... but apps may want to control how the score is "boosted" by position.
        Hide
        Remi Melisson added a comment - - edited

        Hi!
        Here is new patch including your comment for the coefficient calculation (I guess a Lambda function would be perfect here!).

        I ran the performance test on my laptop, here are the results compared to the AnalyzingInfixSuggester :
        – construction time
        AnalyzingInfixSuggester input: 50001, time[ms]: 1780 [+- 367.58]
        BlendedInfixSuggester input: 50001, time[ms]: 6507 [+- 2106.52]
        – prefixes: 2-4, num: 7, onlyMorePopular: false
        AnalyzingInfixSuggester queries: 50001, time[ms]: 6804 [+- 1403.13], ~kQPS: 7
        BlendedInfixSuggester queries: 50001, time[ms]: 26503 [+- 2624.41], ~kQPS: 2
        – prefixes: 6-9, num: 7, onlyMorePopular: false
        AnalyzingInfixSuggester queries: 50001, time[ms]: 3995 [+- 551.20], ~kQPS: 13
        BlendedInfixSuggester queries: 50001, time[ms]: 5355 [+- 1295.41], ~kQPS: 9
        – prefixes: 100-200, num: 7, onlyMorePopular: false
        AnalyzingInfixSuggester queries: 50001, time[ms]: 2626 [+- 588.43], ~kQPS: 19
        BlendedInfixSuggester queries: 50001, time[ms]: 1980 [+- 574.16], ~kQPS: 25
        – RAM consumption
        AnalyzingInfixSuggester size[B]: 1,430,920
        BlendedInfixSuggester size[B]: 1,630,488

        If you have any idea on how we could improve the performance, let me know (see above my comment for your previous suggestion to avoid visiting term vectors).

        Show
        Remi Melisson added a comment - - edited Hi! Here is new patch including your comment for the coefficient calculation (I guess a Lambda function would be perfect here!). I ran the performance test on my laptop, here are the results compared to the AnalyzingInfixSuggester : – construction time AnalyzingInfixSuggester input: 50001, time [ms] : 1780 [+- 367.58] BlendedInfixSuggester input: 50001, time [ms] : 6507 [+- 2106.52] – prefixes: 2-4, num: 7, onlyMorePopular: false AnalyzingInfixSuggester queries: 50001, time [ms] : 6804 [+- 1403.13] , ~kQPS: 7 BlendedInfixSuggester queries: 50001, time [ms] : 26503 [+- 2624.41] , ~kQPS: 2 – prefixes: 6-9, num: 7, onlyMorePopular: false AnalyzingInfixSuggester queries: 50001, time [ms] : 3995 [+- 551.20] , ~kQPS: 13 BlendedInfixSuggester queries: 50001, time [ms] : 5355 [+- 1295.41] , ~kQPS: 9 – prefixes: 100-200, num: 7, onlyMorePopular: false AnalyzingInfixSuggester queries: 50001, time [ms] : 2626 [+- 588.43] , ~kQPS: 19 BlendedInfixSuggester queries: 50001, time [ms] : 1980 [+- 574.16] , ~kQPS: 25 – RAM consumption AnalyzingInfixSuggester size [B] : 1,430,920 BlendedInfixSuggester size [B] : 1,630,488 If you have any idea on how we could improve the performance, let me know (see above my comment for your previous suggestion to avoid visiting term vectors).
        Hide
        Michael McCandless added a comment -

        Thanks Remi, the performance seems fine? But I realized this is not the best benchmark, since all suggestions are just a single token.

        New patch looks great; I think we should commit this approach, and performance improvements can come later if necessary.

        see above my comment for your previous suggestion to avoid visiting term vectors

        Oh, the idea I had was to not use term vectors at all: you can get a TermsEnum for the normal inverted index, and then visit each term from the query, and then .advance to each doc from the top N results. But we can do this later ... I'll commit this patch (I'll make some small code style improvements, e.g. adding { } around all ifs).

        Show
        Michael McCandless added a comment - Thanks Remi, the performance seems fine? But I realized this is not the best benchmark, since all suggestions are just a single token. New patch looks great; I think we should commit this approach, and performance improvements can come later if necessary. see above my comment for your previous suggestion to avoid visiting term vectors Oh, the idea I had was to not use term vectors at all: you can get a TermsEnum for the normal inverted index, and then visit each term from the query, and then .advance to each doc from the top N results. But we can do this later ... I'll commit this patch (I'll make some small code style improvements, e.g. adding { } around all ifs).
        Hide
        Michael McCandless added a comment -

        Thanks Remi!

        I committed with the wrong issue LUCENE-5345 by accident...

        Show
        Michael McCandless added a comment - Thanks Remi! I committed with the wrong issue LUCENE-5345 by accident...
        Hide
        Remi Melisson added a comment -

        Great, glad to contribute!
        In term of performance, I'm using it on my laptop with 30K terms and the mean time for lookup is 5ms for 5 results and 45ms for 50 results (with a factor 10, ie. I retrieve 50 / 500 items then reduce to 5 / 50). I'm not following a proper testing methodology so it's just roughly what I observed.
        I will do more extensive testing performance-wise and yeah, we can tackle that later on.

        Show
        Remi Melisson added a comment - Great, glad to contribute! In term of performance, I'm using it on my laptop with 30K terms and the mean time for lookup is 5ms for 5 results and 45ms for 50 results (with a factor 10, ie. I retrieve 50 / 500 items then reduce to 5 / 50). I'm not following a proper testing methodology so it's just roughly what I observed. I will do more extensive testing performance-wise and yeah, we can tackle that later on.
        Hide
        Remi Melisson added a comment - - edited

        Woops, I introduced a bug when refactoring the comparator.
        I submitted another patch to fix this. I also updated the test case accordingly.

        Show
        Remi Melisson added a comment - - edited Woops, I introduced a bug when refactoring the comparator. I submitted another patch to fix this. I also updated the test case accordingly.
        Hide
        Michael McCandless added a comment -

        OK, no problem, I'll have a look! Thanks Remi.

        Show
        Michael McCandless added a comment - OK, no problem, I'll have a look! Thanks Remi.
        Hide
        ASF subversion and git services added a comment -

        Commit 1558100 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1558100 ]

        LUCENE-5354: BlendedInfixSuggester: fix wrong return (0 instead of -1) from the LookupResult comparator

        Show
        ASF subversion and git services added a comment - Commit 1558100 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1558100 ] LUCENE-5354 : BlendedInfixSuggester: fix wrong return (0 instead of -1) from the LookupResult comparator
        Hide
        ASF subversion and git services added a comment -

        Commit 1558102 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1558102 ]

        LUCENE-5354: BlendedInfixSuggester: fix wrong return (0 instead of -1) from the LookupResult comparator

        Show
        ASF subversion and git services added a comment - Commit 1558102 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558102 ] LUCENE-5354 : BlendedInfixSuggester: fix wrong return (0 instead of -1) from the LookupResult comparator

          People

          • Assignee:
            Unassigned
            Reporter:
            Remi Melisson
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development