Lucene - Core
  1. Lucene - Core
  2. LUCENE-2557

FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

    Details

    • Type: Bug Bug
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.2
    • Fix Version/s: None
    • Component/s: core/query/scoring
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally.

      For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

      1. LUCENE-2557.patch
        7 kB
        Jingkei Ly
      2. idf-scoring-test-case.patch
        3 kB
        Jingkei Ly

        Activity

        Hide
        Jingkei Ly added a comment -

        I've attached a test case which demonstrates some of the scoring issues (the patch applies to the existing TestFuzzyQuery class). With the default FuzzyQuery, the fuzzy terms "joness" and "smiith" get promoted to the top of the search results because they have higher IDFs than the exact matches.

        If you modify the test so that the FuzzyQuerys use TopTermsBoostOnlyBooleanQueryRewrite, i.e. uncomment these lines in the test case:

        smithQuery.setRewriteMethod(new MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite());
        jonesQuery.setRewriteMethod(new MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite());
        

        The fuzzy terms are correctly relegated to the bottom of the search results but, because IDF is ignored, "jones" appears more highly scored than "smith" even though "smith" is the rarer term.

        Ideally the solution should solve both these issues.

        Show
        Jingkei Ly added a comment - I've attached a test case which demonstrates some of the scoring issues (the patch applies to the existing TestFuzzyQuery class). With the default FuzzyQuery, the fuzzy terms "joness" and "smiith" get promoted to the top of the search results because they have higher IDFs than the exact matches. If you modify the test so that the FuzzyQuerys use TopTermsBoostOnlyBooleanQueryRewrite, i.e. uncomment these lines in the test case: smithQuery.setRewriteMethod( new MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite()); jonesQuery.setRewriteMethod( new MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite()); The fuzzy terms are correctly relegated to the bottom of the search results but, because IDF is ignored, "jones" appears more highly scored than "smith" even though "smith" is the rarer term. Ideally the solution should solve both these issues.
        Hide
        Robert Muir added a comment -

        Duplicate of LUCENE-124, which added this new rewrite method in trunk and 3.x:

        * LUCENE-124: Add a TopTermsBoostOnlyBooleanQueryRewrite to MultiTermQuery.
          This rewrite method is similar to TopTermsScoringBooleanQueryRewrite, but
          only scores terms by their boost values. For example, this can be used 
          with FuzzyQuery to ensure that exact matches are always scored higher, 
          because only the boost will be used in scoring. 
        
        Show
        Robert Muir added a comment - Duplicate of LUCENE-124 , which added this new rewrite method in trunk and 3.x: * LUCENE-124: Add a TopTermsBoostOnlyBooleanQueryRewrite to MultiTermQuery. This rewrite method is similar to TopTermsScoringBooleanQueryRewrite, but only scores terms by their boost values. For example, this can be used with FuzzyQuery to ensure that exact matches are always scored higher, because only the boost will be used in scoring.
        Hide
        Jingkei Ly added a comment -

        Robert,

        I posted a comment just before your one (apparently in the same minute) - I made an additional point that TopTermsBoostOnlyBooleanQueryRewrite ignores the IDF undesirably - does that make the issue valid again?

        Show
        Jingkei Ly added a comment - Robert, I posted a comment just before your one (apparently in the same minute) - I made an additional point that TopTermsBoostOnlyBooleanQueryRewrite ignores the IDF undesirably - does that make the issue valid again?
        Hide
        Jingkei Ly added a comment -

        I've had a crack at implementing a fix, based on suggestions in LUCENE-329. It takes the IDF of the term used in the FuzzyQuery if it exists in the index and uses that as the IDF. If the term is not in the index it uses the average IDF of all the terms.

        It is implemented as a rewrite method similar to TopTermsBoostOnlyBooleanQueryRewrite from LUCENE-124, although it required modifying TopTermsBooleanQueryRewrite a little bit.

        Show
        Jingkei Ly added a comment - I've had a crack at implementing a fix, based on suggestions in LUCENE-329 . It takes the IDF of the term used in the FuzzyQuery if it exists in the index and uses that as the IDF. If the term is not in the index it uses the average IDF of all the terms. It is implemented as a rewrite method similar to TopTermsBoostOnlyBooleanQueryRewrite from LUCENE-124 , although it required modifying TopTermsBooleanQueryRewrite a little bit.
        Hide
        Robert Muir added a comment -

        I dont understand why we need to average any idfs? this seems really costly and i think in general the idea of fuzzy is to find misspellings.

        furthermore i dont understand why its important if the idf if the query term exists in the index or not, because the query itself could be misspelled.

        Show
        Robert Muir added a comment - I dont understand why we need to average any idfs? this seems really costly and i think in general the idea of fuzzy is to find misspellings. furthermore i dont understand why its important if the idf if the query term exists in the index or not, because the query itself could be misspelled.
        Hide
        Jingkei Ly added a comment -

        I dont understand why we need to average any idfs? this seems really costly and i think in general the idea of fuzzy is to find misspellings.

        I agree that fuzzy is to find misspellings, but I don't think it should favour misspellings above an exact match. I think the reasoning behind the average IDFs (I based that on comments in LUCENE-329), is that in the absence of an IDF from the exact match it's better than nothing to have an average of the terms you do know. Perhaps, there is a better heuristic for that case, though.

        furthermore i dont understand why its important if the idf if the query term exists in the index or not, because the query itself could be misspelled.

        I think it's a fair assumption that users are searching for specific terms (+fore:john +sur:smith), so are unlikely that they would have a misspelling in the original query. If they did misspell it and got erroneous results, it seems it's immediately clear that the cause is a misspelt query.

        Show
        Jingkei Ly added a comment - I dont understand why we need to average any idfs? this seems really costly and i think in general the idea of fuzzy is to find misspellings. I agree that fuzzy is to find misspellings, but I don't think it should favour misspellings above an exact match. I think the reasoning behind the average IDFs (I based that on comments in LUCENE-329 ), is that in the absence of an IDF from the exact match it's better than nothing to have an average of the terms you do know. Perhaps, there is a better heuristic for that case, though. furthermore i dont understand why its important if the idf if the query term exists in the index or not, because the query itself could be misspelled. I think it's a fair assumption that users are searching for specific terms (+fore:john +sur:smith), so are unlikely that they would have a misspelling in the original query. If they did misspell it and got erroneous results, it seems it's immediately clear that the cause is a misspelt query.
        Hide
        Robert Muir added a comment -

        I agree that fuzzy is to find misspellings, but I don't think it should favour misspellings above an exact match.

        So what is the problem with TopTermsBoostOnlyBooleanQueryRewrite? it will never do this.

        While I agree this is a really simple solution to the problem, It seemed to me from the comments in LUCENE-329 that there were differing opinions on how one might want to combine the factors of edit distance boost, tf, idf, etc... it seems it will depend on the application.

        So I definitely don't think this is any bug in fuzzyquery. Personally, I am not against adding new alternative rewrite methods like the one you added here, so that more choices are available. But this just seems to be the same issue as LUCENE-329 to me.

        My personal preference would be to take this code and bring LUCENE-329 up to speed, e.g. creating an alternative in contrib/queries or something that uses Mark Harwoods "smart fuzzy" logic which is currently limited to FuzzyLikeThis.

        Show
        Robert Muir added a comment - I agree that fuzzy is to find misspellings, but I don't think it should favour misspellings above an exact match. So what is the problem with TopTermsBoostOnlyBooleanQueryRewrite? it will never do this. While I agree this is a really simple solution to the problem, It seemed to me from the comments in LUCENE-329 that there were differing opinions on how one might want to combine the factors of edit distance boost, tf, idf, etc... it seems it will depend on the application. So I definitely don't think this is any bug in fuzzyquery. Personally, I am not against adding new alternative rewrite methods like the one you added here, so that more choices are available. But this just seems to be the same issue as LUCENE-329 to me. My personal preference would be to take this code and bring LUCENE-329 up to speed, e.g. creating an alternative in contrib/queries or something that uses Mark Harwoods "smart fuzzy" logic which is currently limited to FuzzyLikeThis.
        Hide
        Mark Harwood added a comment -

        I think we're agreed that the effects of IDF are troublesome when ranking variant term matches but I question that the default solution should be to remove IDF from the equation completely.

        Doing that reminds me of the time my mother thought the shadow in a photograph was annoying and cut it out with a pair of scissors leaving a big hole in its place.
        What we're proposing here instead is the equivalent of some "photoshopping" to retain some of the original information but suitably blurred to provide a more natural balance to the overall picture.

        Some degree of IDF can be usefully retained from a FuzzyQuery in order to acheive balance with all the other (potentially non-fuzzy) optional clauses that may exist in a BooleanQuery.
        The proposal is that the most natural blending of IDF scores within a FuzzyQuery is to use only the IDF of the input term (which defines the user's original intent) and use this to score a match on any suggested variant . If the input term does not exist the average IDF of all variants is used as the next best alternative for scoring each variant.

        This approach has exactly the same ranking effect as the existing "remove IDF" policy within a single FuzzyQuery but has the added advantage of sitting better with the other optional clauses that may exist in a containing query.

        The question over core vs contrib comes down to what is considered the more natural/expected behaviour.

        Show
        Mark Harwood added a comment - I think we're agreed that the effects of IDF are troublesome when ranking variant term matches but I question that the default solution should be to remove IDF from the equation completely. Doing that reminds me of the time my mother thought the shadow in a photograph was annoying and cut it out with a pair of scissors leaving a big hole in its place. What we're proposing here instead is the equivalent of some "photoshopping" to retain some of the original information but suitably blurred to provide a more natural balance to the overall picture. Some degree of IDF can be usefully retained from a FuzzyQuery in order to acheive balance with all the other (potentially non-fuzzy) optional clauses that may exist in a BooleanQuery. The proposal is that the most natural blending of IDF scores within a FuzzyQuery is to use only the IDF of the input term (which defines the user's original intent) and use this to score a match on any suggested variant . If the input term does not exist the average IDF of all variants is used as the next best alternative for scoring each variant. This approach has exactly the same ranking effect as the existing "remove IDF" policy within a single FuzzyQuery but has the added advantage of sitting better with the other optional clauses that may exist in a containing query. The question over core vs contrib comes down to what is considered the more natural/expected behaviour.
        Hide
        Robert Muir added a comment -

        I think we're agreed that the effects of IDF are troublesome when ranking variant term matches but I question that the default solution should be to remove IDF from the equation completely.

        Mark, just fyi, the boost-only rewrite method isnt the default (its just a simple option, but no runtime behavior has changed, it still uses "normal" boolean expansion as default).

        I basically agree with your idea that it would be nicer to add a smarter rewrite method, and if so, probably change the defaults. But my concerns are:

        • this input term / does not exist thing seems a little wierd to me as mentioned.
        • doing a lot of docfreq/idf calls seems expensive? this is no problem with FuzzyLikeThis i think though, doesnt it uses a more reasonable PQ size? (50 or something)
        Show
        Robert Muir added a comment - I think we're agreed that the effects of IDF are troublesome when ranking variant term matches but I question that the default solution should be to remove IDF from the equation completely. Mark, just fyi, the boost-only rewrite method isnt the default (its just a simple option, but no runtime behavior has changed, it still uses "normal" boolean expansion as default). I basically agree with your idea that it would be nicer to add a smarter rewrite method, and if so, probably change the defaults. But my concerns are: this input term / does not exist thing seems a little wierd to me as mentioned. doing a lot of docfreq/idf calls seems expensive? this is no problem with FuzzyLikeThis i think though, doesnt it uses a more reasonable PQ size? (50 or something)
        Hide
        Robert Muir added a comment -

        so here is an option for this issue. we could reword the whole issue as 'improve FuzzyQuery defaults'.

        If we were to do this, i would suggest the following at the minimum:

        • instead of a default distance of 0.5 (from queryparser), if distance isnt provided (~0.6 etc), calculate one that will perform well and never brute-force compare all the terms.
        • instead of a default max expansions of booleanquery max clause count (1024), use a more reasonable # of expansions by default (such as 50)
        • instead of the current rewrite, use a rewrite similar to FuzzyLikeThis. maybe we dont need to average docfreq across all 50 terms even, maybe the top-5 or so is sufficient.

        If we were to do something like this, maybe we could improve performance and behavior instead of making tradeoffs.

        Show
        Robert Muir added a comment - so here is an option for this issue. we could reword the whole issue as 'improve FuzzyQuery defaults'. If we were to do this, i would suggest the following at the minimum: instead of a default distance of 0.5 (from queryparser), if distance isnt provided (~0.6 etc), calculate one that will perform well and never brute-force compare all the terms. instead of a default max expansions of booleanquery max clause count (1024), use a more reasonable # of expansions by default (such as 50) instead of the current rewrite, use a rewrite similar to FuzzyLikeThis. maybe we dont need to average docfreq across all 50 terms even, maybe the top-5 or so is sufficient. If we were to do something like this, maybe we could improve performance and behavior instead of making tradeoffs.
        Hide
        Mark Harwood added a comment -

        I dont understand why we need to average any idfs? this seems really costly

        IDF lookups and averaging etc should only be calculated for the top "n" terms that finally make it into the query. "Top" in this case being some edit distance threshold or synonymity measure. All the required doc frequency info for IDF is available in RAM on TermEnum which is iterated across anyway and so shouldn't incur any extra disk seeks. So given a query that expands to 1,000 terms the cost of computing the average IDF for that set of terms is surely lost in the cost of 1,000 disk seeks on the TermDocs as part of query evaluation? I need to review the code to remind myself of how it is processed but it feels like it should be cheap.

        average docfreq across all 50 terms even, maybe the top-5 or so is sufficient.

        That could work. The IDF score simply has to be a value that is used as a constant for all the expanded terms in a fuzzy query and, as an added bonus, represents a value that can be usefully contrasted with other query clauses. The averaging policy is just a fall-back position in the rarer situations when a user's original input term has no associated IDF value we can use. If this policy is a performance concern then we could reduce the number of terms as you suggest or just ignore IDF entirely in this case but I'm not sure the averaging costs represent any kind of real performance concern given the IO costs of accessing TermDocs.

        Show
        Mark Harwood added a comment - I dont understand why we need to average any idfs? this seems really costly IDF lookups and averaging etc should only be calculated for the top "n" terms that finally make it into the query. "Top" in this case being some edit distance threshold or synonymity measure. All the required doc frequency info for IDF is available in RAM on TermEnum which is iterated across anyway and so shouldn't incur any extra disk seeks. So given a query that expands to 1,000 terms the cost of computing the average IDF for that set of terms is surely lost in the cost of 1,000 disk seeks on the TermDocs as part of query evaluation? I need to review the code to remind myself of how it is processed but it feels like it should be cheap. average docfreq across all 50 terms even, maybe the top-5 or so is sufficient. That could work. The IDF score simply has to be a value that is used as a constant for all the expanded terms in a fuzzy query and, as an added bonus, represents a value that can be usefully contrasted with other query clauses. The averaging policy is just a fall-back position in the rarer situations when a user's original input term has no associated IDF value we can use. If this policy is a performance concern then we could reduce the number of terms as you suggest or just ignore IDF entirely in this case but I'm not sure the averaging costs represent any kind of real performance concern given the IO costs of accessing TermDocs.
        Hide
        Robert Muir added a comment -

        If this policy is a performance concern then we could reduce the number of terms as you suggest or just ignore IDF entirely in this case but I'm not sure the averaging costs represent any kind of real performance concern given the IO costs of accessing TermDocs.

        I suggested reducing the number of terms (for the averaging), but also the number of default expansions.
        I think in general expanding to 1024 is obscene...

        But also, if we reduce this number, FuzzyTermsEnum itself gets faster, too.
        FuzzyTermsEnum is aware (via an attribute) when the priority queue is filled, and it knows the minimal score to be competitive.
        When a certain edit distance is no longer competitive, it optimizes itself by swapping in a more efficient Automaton.
        This is safe because the pq's comparator is score, then the term's compareTo (lexicographic order).

        Simple example: lets say you ask for a max of 1 expansions, but with a fuzzy query of max 1 edit distance.
        as soon as the enum finds a term of ed=1, terms of ed=1 are no longer competitive, so it will then try to seek
        to an exact match (swapping in an ed=0 automaton) and exit, instead of wasting time seeking to useless terms.

        its a bit more complicated since the boost value is really not just edit distance but also string length, but I think this illustration works,
        its one reason why I think we should try to 'improve the defaults'.

        Show
        Robert Muir added a comment - If this policy is a performance concern then we could reduce the number of terms as you suggest or just ignore IDF entirely in this case but I'm not sure the averaging costs represent any kind of real performance concern given the IO costs of accessing TermDocs. I suggested reducing the number of terms (for the averaging), but also the number of default expansions. I think in general expanding to 1024 is obscene... But also, if we reduce this number, FuzzyTermsEnum itself gets faster, too. FuzzyTermsEnum is aware (via an attribute) when the priority queue is filled, and it knows the minimal score to be competitive. When a certain edit distance is no longer competitive, it optimizes itself by swapping in a more efficient Automaton. This is safe because the pq's comparator is score, then the term's compareTo (lexicographic order). Simple example: lets say you ask for a max of 1 expansions, but with a fuzzy query of max 1 edit distance. as soon as the enum finds a term of ed=1, terms of ed=1 are no longer competitive, so it will then try to seek to an exact match (swapping in an ed=0 automaton) and exit, instead of wasting time seeking to useless terms. its a bit more complicated since the boost value is really not just edit distance but also string length, but I think this illustration works, its one reason why I think we should try to 'improve the defaults'.
        Hide
        Eks Dev added a comment -

        It looks like we have one invariant:
        IDF(QueryTerm) >= IDF(Expansion Term) // Preventing better scoring documents with ET then Documents with exact match on QT.

        Fixing all expansions to IDF(QT) would remove dynamics of the score, making the contribution to the score for all expansions identical. Maybe proportionally scaling IDF of all expansions to preserve mutual IDF dynamics, (relative to IDF(QT) to keep-up with invariant) would work better?

        In case when there is no matching QueryTerm, why not simply preserving expansion Term IDF, what is averaging good for, performance?

        Show
        Eks Dev added a comment - It looks like we have one invariant: IDF(QueryTerm) >= IDF(Expansion Term) // Preventing better scoring documents with ET then Documents with exact match on QT. Fixing all expansions to IDF(QT) would remove dynamics of the score, making the contribution to the score for all expansions identical. Maybe proportionally scaling IDF of all expansions to preserve mutual IDF dynamics, (relative to IDF(QT) to keep-up with invariant) would work better? In case when there is no matching QueryTerm, why not simply preserving expansion Term IDF, what is averaging good for, performance?
        Hide
        Mark Harwood added a comment -

        Fixing all expansions to IDF(QT) would remove dynamics of the score, making the contribution to the score for all expansions identical.

        The "boost" property is used by fuzzy/synonyms etc to express the preference for one term variant over another. The effects of this boost setting are demonstrably wiped out when unfiltered IDF of term variants is used (see the attached Junit)

        , why not simply preserving expansion Term IDF,

        See above. The objective is for all variants in an expanded query to share the same IDF setting in order for the boost setting to work as required.

        Show
        Mark Harwood added a comment - Fixing all expansions to IDF(QT) would remove dynamics of the score, making the contribution to the score for all expansions identical. The "boost" property is used by fuzzy/synonyms etc to express the preference for one term variant over another. The effects of this boost setting are demonstrably wiped out when unfiltered IDF of term variants is used (see the attached Junit) , why not simply preserving expansion Term IDF, See above. The objective is for all variants in an expanded query to share the same IDF setting in order for the boost setting to work as required.
        Hide
        Paul taylor added a comment - - edited
        Show
        Paul taylor added a comment - - edited This post explains why this would be useful for Prefix queries as well http://stackoverflow.com/questions/9632602/there-is-a-mismatch-between-the-score-for-a-wildcard-match-and-an-exact-match
        Hide
        Uwe Schindler added a comment -

        In general for PrefixQueries (which are contant score), the trick to match exact matches higher is to add the exact match query as an additional SHOULD clause: "exact exact*". This way an exact match scores always higher.

        Show
        Uwe Schindler added a comment - In general for PrefixQueries (which are contant score), the trick to match exact matches higher is to add the exact match query as an additional SHOULD clause: "exact exact*". This way an exact match scores always higher.
        Hide
        Paul taylor added a comment -

        Ive simplified my usecase for this example, the exact match and prefix match are both in the query but Im creating a disjunction query so it takes the max value rather than sum of matches. This is required because the user query actually get searched over multiple fields and I just want the best match I dont need a search term to match more than one field.

        Show
        Paul taylor added a comment - Ive simplified my usecase for this example, the exact match and prefix match are both in the query but Im creating a disjunction query so it takes the max value rather than sum of matches. This is required because the user query actually get searched over multiple fields and I just want the best match I dont need a search term to match more than one field.
        Hide
        Uwe Schindler added a comment -

        Maybe put exact and prefix in one separate BQ and then add this BQ to the DisjMax.

        Show
        Uwe Schindler added a comment - Maybe put exact and prefix in one separate BQ and then add this BQ to the DisjMax.
        Hide
        Paul taylor added a comment -

        This would prevent prefix matches from being higher then exact matches but doesn't work for the opposite problem, the differential between exact score match and prefix match only is too great when using constant score for prefix queries, and unreliable when prefix queries are scored. BUt I think my link explains it much more clearly.

        Show
        Paul taylor added a comment - This would prevent prefix matches from being higher then exact matches but doesn't work for the opposite problem, the differential between exact score match and prefix match only is too great when using constant score for prefix queries, and unreliable when prefix queries are scored. BUt I think my link explains it much more clearly.
        Hide
        Robert Muir added a comment -

        In the example of the link, I'm still lost as to why its really useful.

        If all queries are prefix queries, what is the use case for that? Maybe really what is needed is
        suggester, stemming, decompounding, or some combination of those?

        If its really some unique use case that isnt one of the above, and every query must be a prefix query,
        at least bake it in with edge-ngrams filter at index time for better performance. (and just use termquery)

        Show
        Robert Muir added a comment - In the example of the link, I'm still lost as to why its really useful. If all queries are prefix queries, what is the use case for that? Maybe really what is needed is suggester, stemming, decompounding, or some combination of those? If its really some unique use case that isnt one of the above, and every query must be a prefix query, at least bake it in with edge-ngrams filter at index time for better performance. (and just use termquery)
        Hide
        Paul taylor added a comment -

        I think you are missing point here whilst there may be some way I can achieve my aim by changing indexing (heres an only slight outof date example using the actual code http://svn.musicbrainz.org/search_server/trunk/servlet/src/main/java/org/musicbrainz/search/servlet/DismaxQueryParser.java) the fact is that it doesn't make sense to fully score fuzzy queries if your query contains exact and fuzzy matches (the current default), and it doesnt make sense to use a constant score either (your suggestion), a default that does score but uses the idf of the search term would give much better default results.

        Show
        Paul taylor added a comment - I think you are missing point here whilst there may be some way I can achieve my aim by changing indexing (heres an only slight outof date example using the actual code http://svn.musicbrainz.org/search_server/trunk/servlet/src/main/java/org/musicbrainz/search/servlet/DismaxQueryParser.java ) the fact is that it doesn't make sense to fully score fuzzy queries if your query contains exact and fuzzy matches (the current default), and it doesnt make sense to use a constant score either (your suggestion), a default that does score but uses the idf of the search term would give much better default results.
        Hide
        Robert Muir added a comment -

        the fact is that it doesn't make sense to fully score fuzzy queries if your query contains exact and fuzzy matches (the current default), and it doesnt make sense to use a constant score either (your suggestion), a default that does score but uses the idf of the search term would give much better default results.

        IDF is an implementation detail of the similarity in trunk. Some scoring systems don't even use IDF.

        If you want what you describe: what is the problem? use TopTermsBoostOnlyRewrite and apply the "IDF" or whatever you want of the search-term yourself as the query boost.

        Show
        Robert Muir added a comment - the fact is that it doesn't make sense to fully score fuzzy queries if your query contains exact and fuzzy matches (the current default), and it doesnt make sense to use a constant score either (your suggestion), a default that does score but uses the idf of the search term would give much better default results. IDF is an implementation detail of the similarity in trunk. Some scoring systems don't even use IDF. If you want what you describe: what is the problem? use TopTermsBoostOnlyRewrite and apply the "IDF" or whatever you want of the search-term yourself as the query boost.
        Hide
        Paul taylor added a comment -

        Sure, but idf is used by default, my current problem is I dont know how to do code it to use the tf of the search term. But I was really commenting on this query to agree with the op that the default (as of 3.5) does not work well

        Show
        Paul taylor added a comment - Sure, but idf is used by default, my current problem is I dont know how to do code it to use the tf of the search term. But I was really commenting on this query to agree with the op that the default (as of 3.5) does not work well
        Hide
        Paul taylor added a comment -

        .. and the patch doesnt seme to work when applies against the version specified of 3.0.2

        Show
        Paul taylor added a comment - .. and the patch doesnt seme to work when applies against the version specified of 3.0.2

          People

          • Assignee:
            Unassigned
            Reporter:
            Jingkei Ly
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development