Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

      Description

      Queries which automatically produce multiple terms (wildcard, range, prefix,
      fuzzy etc)currently suffer from two problems:

      1) Scores for matching documents are significantly smaller than term queries
      because of the volume of terms introduced (A match on query Foo~ is 0.1
      whereas a match on query Foo is 1).
      2) The rarer forms of expanded terms are favoured over those of more common
      forms because of the IDF. When using Fuzzy queries for example, rare mis-
      spellings typically appear in results before the more common correct spellings.

      I will attach a patch that corrects the issues identified above by
      1) Overriding Similarity.coord to counteract the downplaying of scores
      introduced by expanding terms.
      2) Taking the IDF factor of the most common form of expanded terms as the
      basis of scoring all other expanded terms.

        Issue Links

          Activity

          Hide
          Paul taylor added a comment -

          Why has this been closed ?

          Show
          Paul taylor added a comment - Why has this been closed ?
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1
          Hide
          Mark Harwood added a comment -

          I think you can safely implement a RewriteMethod to do whatever you want?

          Yep, I've got workarounds using FuzzyLikeThis that work for me but have long had a general unease about the "out of the box" experience for others.

          However things are certainly better than they were when this issue was first raised and the main concerns have been addressed.

          So FuzzyQuery behaves now more as one would expect

          Is it worth explicitly stating those expectations? Mine would be based on these principles:
          1) IDF is commonly accepted as useful when ranking partial matches of queries with multiple optional clauses
          2) IDF doesn't stop being useful if one of those clauses just happens to be a term flagged as "fuzzy".

          So given a query: rareWord~ OR commonWord~
          I would expect an exact match on "rareWord" to rank higher than an exact match on "commonWord".
          I don't think the current implementation respects this.

          Show
          Mark Harwood added a comment - I think you can safely implement a RewriteMethod to do whatever you want? Yep, I've got workarounds using FuzzyLikeThis that work for me but have long had a general unease about the "out of the box" experience for others. However things are certainly better than they were when this issue was first raised and the main concerns have been addressed. So FuzzyQuery behaves now more as one would expect Is it worth explicitly stating those expectations? Mine would be based on these principles: 1) IDF is commonly accepted as useful when ranking partial matches of queries with multiple optional clauses 2) IDF doesn't stop being useful if one of those clauses just happens to be a term flagged as "fuzzy". So given a query: rareWord~ OR commonWord~ I would expect an exact match on "rareWord" to rank higher than an exact match on "commonWord". I don't think the current implementation respects this.
          Hide
          Robert Muir added a comment -

          Mark, I tend to agree, but at the same time I think you can safely implement
          a RewriteMethod to do whatever you want? (e.g. apply the logic of FuzzyLikeThis)

          Doing something special with IDF is really specific to certain Similarities, for example
          your Similarity might not use the traditional IDF at all, but something involving
          totalTermFreq and sumOfTotalTermFreq (like language modelling).

          So I am concerned about doing tricky things with the scoring system by default
          for this query... we provide the simple options in core (Scoring, BoostOnly, etc) though.

          An idea would be to factor the logic out of FuzzyLikeThisQuery into a FuzzyLikeThisRewriteMethod,
          so you could just call .setRewriteMethod on your fuzzy query and use it.

          Show
          Robert Muir added a comment - Mark, I tend to agree, but at the same time I think you can safely implement a RewriteMethod to do whatever you want? (e.g. apply the logic of FuzzyLikeThis) Doing something special with IDF is really specific to certain Similarities, for example your Similarity might not use the traditional IDF at all, but something involving totalTermFreq and sumOfTotalTermFreq (like language modelling). So I am concerned about doing tricky things with the scoring system by default for this query... we provide the simple options in core (Scoring, BoostOnly, etc) though. An idea would be to factor the logic out of FuzzyLikeThisQuery into a FuzzyLikeThisRewriteMethod, so you could just call .setRewriteMethod on your fuzzy query and use it.
          Hide
          Uwe Schindler added a comment -

          I agree with your complaints, but this issue was more about MTQ queries at all and strange scoring. So FuzzyQuery behaves now more as one would expect. We already have different variants of this query like FuzzyLikeThis that also solve this issue.

          This is why I closed the issue.

          Show
          Uwe Schindler added a comment - I agree with your complaints, but this issue was more about MTQ queries at all and strange scoring. So FuzzyQuery behaves now more as one would expect. We already have different variants of this query like FuzzyLikeThis that also solve this issue. This is why I closed the issue.
          Hide
          Mark Harwood added a comment -

          So term's idf is not used at all. This should solve this problem

          A sub-optimal for the reasons I outlined earlier:

          The problem with ignoring IDF completely is that it doesn't help balance partial matches where there is >1 fuzzy element in the query e.g.in a query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery)

          Show
          Mark Harwood added a comment - So term's idf is not used at all. This should solve this problem A sub-optimal for the reasons I outlined earlier: The problem with ignoring IDF completely is that it doesn't help balance partial matches where there is >1 fuzzy element in the query e.g.in a query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery)
          Hide
          Uwe Schindler added a comment -

          Recent 3.x versions contain a new FuzzyQuery rewrite method (MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite) that uses a BooleanQuery with clauses of ConstantScoreQuery(TermQuery) usig the fuzzy boost. So term's idf is not used at all.

          This should solve this problem.

          Show
          Uwe Schindler added a comment - Recent 3.x versions contain a new FuzzyQuery rewrite method (MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite) that uses a BooleanQuery with clauses of ConstantScoreQuery(TermQuery) usig the fuzzy boost. So term's idf is not used at all. This should solve this problem.
          Hide
          Mark Harwood added a comment - - edited

          consider simpler case

          OK - but we need to remember that it is important to achieve balance across different fuzzy queries as well as terms within the same fuzzy query.
          Let's stick to the terms within a single fuzzy query for now:

          I guess you would like to score the second term higher, meaning Lower frequency

          No, variant's frequency is not a deciding factor - only edit distance. "Johana" is similarity 0.6 while "Joahn" is 0.2 so I would favour result one (although the this difference seems a little off in this case)
          The basic assumption is that user's input is valid and not a typo (deriving spelling suggestions etc are a different topic and one we shouldnt try cover here).
          Fuzzy matching can drag in all sorts of unqualified variants with massively different frequencies. Because we cannot control these discrepancies we should reward all these alternatives using the known factors we have to hand - the IDF of the user's supposedly valid input and the similarity measure of each variant compared to the input.
          We could get fancy about probability of variants given the other input terms in the query but that feels like its straying into spell checker territory and ngrams etc.

          Show
          Mark Harwood added a comment - - edited consider simpler case OK - but we need to remember that it is important to achieve balance across different fuzzy queries as well as terms within the same fuzzy query. Let's stick to the terms within a single fuzzy query for now: I guess you would like to score the second term higher, meaning Lower frequency No, variant's frequency is not a deciding factor - only edit distance. "Johana" is similarity 0.6 while "Joahn" is 0.2 so I would favour result one (although the this difference seems a little off in this case) The basic assumption is that user's input is valid and not a typo (deriving spelling suggestions etc are a different topic and one we shouldnt try cover here). Fuzzy matching can drag in all sorts of unqualified variants with massively different frequencies. Because we cannot control these discrepancies we should reward all these alternatives using the known factors we have to hand - the IDF of the user's supposedly valid input and the similarity measure of each variant compared to the input. We could get fancy about probability of variants given the other input terms in the query but that feels like its straying into spell checker territory and ngrams etc.
          Hide
          Eks Dev added a comment -

          query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename.

          as a matter of fact, we have not only one frequency to consider, rather two Term frequencies!

          consider simpler case
          Query term: "Johan" //would be High frequency term
          gives:
          Fuzzy Expanded term1 "Johana" // High frequency
          Fuzzy Expanded term2 "Joahn" // Low Freq

          I guess you would like to score the second term higher, meaning Lower frequency (higher IDF)... So far so good.

          Now turn it upside down and search for LF typo "Joahn"... in that case you would preffer HF Term "Johan" from expanded list to score higher...

          Point being, this situation here is just not "complete" without taking both frequencies into consideration (Query Term and Expanded term). In my experience, some simple nonlinear hints based on these two freqs bring some easy precision points (HF-LF Pairs are much more likely to be typos that two HF-HF... ).

          Show
          Eks Dev added a comment - query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. as a matter of fact, we have not only one frequency to consider, rather two Term frequencies! consider simpler case Query term: "Johan" //would be High frequency term gives: Fuzzy Expanded term1 "Johana" // High frequency Fuzzy Expanded term2 "Joahn" // Low Freq I guess you would like to score the second term higher, meaning Lower frequency (higher IDF)... So far so good. Now turn it upside down and search for LF typo "Joahn"... in that case you would preffer HF Term "Johan" from expanded list to score higher... Point being, this situation here is just not "complete" without taking both frequencies into consideration (Query Term and Expanded term). In my experience, some simple nonlinear hints based on these two freqs bring some easy precision points (HF-LF Pairs are much more likely to be typos that two HF-HF... ).
          Hide
          Robert Muir added a comment -

          My "best-practice" suggestion isn't as simple as offering a choice between preserving IDF for all terms or not.

          Mark, right, my mistake. I will move this patch to LUCENE-124 so there is a simple alternative, you can proceed here with a smarter method... sorry i got confused amongst the different issues

          Show
          Robert Muir added a comment - My "best-practice" suggestion isn't as simple as offering a choice between preserving IDF for all terms or not. Mark, right, my mistake. I will move this patch to LUCENE-124 so there is a simple alternative, you can proceed here with a smarter method... sorry i got confused amongst the different issues
          Hide
          Mark Harwood added a comment -

          My "best-practice" suggestion isn't as simple as offering a choice between preserving IDF for all terms or not.

          Instead, it is a proposal that we should use the input term's IDF for scoring all variants of the same root term (or taking an average of variants where the root term does not exist).

          This I feel preserves the benefits of keeping IDF as a factor (as in my John~ Patitucci~ balancing example) but also eliminating the side effects we see where a rare mis-spelling beats exact matches.

          Show
          Mark Harwood added a comment - My "best-practice" suggestion isn't as simple as offering a choice between preserving IDF for all terms or not. Instead, it is a proposal that we should use the input term's IDF for scoring all variants of the same root term (or taking an average of variants where the root term does not exist). This I feel preserves the benefits of keeping IDF as a factor (as in my John~ Patitucci~ balancing example) but also eliminating the side effects we see where a rare mis-spelling beats exact matches.
          Hide
          Robert Muir added a comment -

          I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery)

          oh I see, this issue is really different from LUCENE-124 (they arent technically duplicates). I can move my patch there. You can still create a 'smarter' method here, it won't get in the way as now FuzzyQuery does not have a hardcoded rewrite method.

          Show
          Robert Muir added a comment - I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery) oh I see, this issue is really different from LUCENE-124 (they arent technically duplicates). I can move my patch there. You can still create a 'smarter' method here, it won't get in the way as now FuzzyQuery does not have a hardcoded rewrite method.
          Hide
          Robert Muir added a comment -

          here is a rough patch

          Show
          Robert Muir added a comment - here is a rough patch
          Hide
          Robert Muir added a comment -

          The problem with ignoring IDF completely is that it doesn't help balance partial matches where there is >1 fuzzy element in the query e.g.in a query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery)

          Mark, it wouldn't lose any features. we simply provide another option, just like we do for other MultiTermQuery rewrites for other queries, so users can choose what they want to use. its just an additional choice.

          Show
          Robert Muir added a comment - The problem with ignoring IDF completely is that it doesn't help balance partial matches where there is >1 fuzzy element in the query e.g.in a query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery) Mark, it wouldn't lose any features. we simply provide another option, just like we do for other MultiTermQuery rewrites for other queries, so users can choose what they want to use. its just an additional choice.
          Hide
          Mark Harwood added a comment -

          The problem with ignoring IDF completely is that it doesn't help balance partial matches where there is >1 fuzzy element in the query e.g.in a query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery)

          Show
          Mark Harwood added a comment - The problem with ignoring IDF completely is that it doesn't help balance partial matches where there is >1 fuzzy element in the query e.g.in a query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery)
          Hide
          Robert Muir added a comment -

          in my opinion this is now fixable in a very nice way.

          we can make an alternative rewrite method for fuzzy that does just like TopTermsRewrite, except it creates a BooleanQuery of ConstantScore queries instead. this way the score will be equal to the boost.

          then users could choose which one they want to use.

          Show
          Robert Muir added a comment - in my opinion this is now fixable in a very nice way. we can make an alternative rewrite method for fuzzy that does just like TopTermsRewrite, except it creates a BooleanQuery of ConstantScore queries instead. this way the score will be equal to the boost. then users could choose which one they want to use.
          Hide
          Mark Harwood added a comment -

          This patch goes back a while.
          Contrib's FuzzyLikeThisQuery contains my current "best practice" for fuzzy matching but the logic is mixed in with code that also does "LikeThis" optimisations ie working out which input terms are the best to search on rather than using all input terms. This could usefully be lifted out and used elsewhere.

          The fuzzy scoring logic takes the IDF of the input term and uses that as the IDF for scoring all expanded variants. If the input term does not exist then all variants are rewarded with their averaged IDF. Coord is disabled.

          Using some form of IDF is typically desirable to balance a fuzzy query with other (potentially non fuzzy) clauses in the overall user query. Within a fuzzy query (or wildcard or other auto-expanding queries) however I see no reason to differentiate between the auto-expanded terms with different IDF values. In my view these auto-expand queries should generally use the same IDF for all variants and only reward them differently based on edit distance or what other distance metric is meaningful to that form of expansion (e.g. age range query on age 40 +/- 10 years could reward based on closeness to input term 40).

          Cheers
          Mark

          Show
          Mark Harwood added a comment - This patch goes back a while. Contrib's FuzzyLikeThisQuery contains my current "best practice" for fuzzy matching but the logic is mixed in with code that also does "LikeThis" optimisations ie working out which input terms are the best to search on rather than using all input terms. This could usefully be lifted out and used elsewhere. The fuzzy scoring logic takes the IDF of the input term and uses that as the IDF for scoring all expanded variants. If the input term does not exist then all variants are rewarded with their averaged IDF. Coord is disabled. Using some form of IDF is typically desirable to balance a fuzzy query with other (potentially non fuzzy) clauses in the overall user query. Within a fuzzy query (or wildcard or other auto-expanding queries) however I see no reason to differentiate between the auto-expanded terms with different IDF values. In my view these auto-expand queries should generally use the same IDF for all variants and only reward them differently based on edit distance or what other distance metric is meaningful to that form of expansion (e.g. age range query on age 40 +/- 10 years could reward based on closeness to input term 40). Cheers Mark
          Hide
          Alexey Lef added a comment -

          I've been using DisjunctionMaxQuery for term expansion. It seems to be a much more natural fit for this kind of problem. BooleanQuery never worked for me even with disableCoord.

          Show
          Alexey Lef added a comment - I've been using DisjunctionMaxQuery for term expansion. It seems to be a much more natural fit for this kind of problem. BooleanQuery never worked for me even with disableCoord.
          Hide
          Mark Miller added a comment -

          Any consensus on this? We have the disableCoord and now the option to use a constant score mode for all the mult-term queries. Is that enough? What do you think Mark H? Does this patch still make sense?

          Show
          Mark Miller added a comment - Any consensus on this? We have the disableCoord and now the option to use a constant score mode for all the mult-term queries. Is that enough? What do you think Mark H? Does this patch still make sense?
          Hide
          Marco Dissel added a comment -

          >> but assuming "clean data" with no mis-spellings, scoring "rare" terms higher seems like the ideal behavior

          Exact matched of a term should have a higher ranking then fuzzy matched terms.. at least that's the expected behaviour in my situation although i think it seems the most common behaviour.. We could also make this optionally, so that we don't affect the current scoring algorithm..

          ps. i'm using the Lucene.NET port, so by changing/extending this behaviour in Java we (or George) can implement this at the .NET version as well.

          Show
          Marco Dissel added a comment - >> but assuming "clean data" with no mis-spellings, scoring "rare" terms higher seems like the ideal behavior Exact matched of a term should have a higher ranking then fuzzy matched terms.. at least that's the expected behaviour in my situation although i think it seems the most common behaviour.. We could also make this optionally, so that we don't affect the current scoring algorithm.. ps. i'm using the Lucene.NET port, so by changing/extending this behaviour in Java we (or George) can implement this at the .NET version as well.
          Hide
          Hoss Man added a comment -

          I'm not very familiar with this issue, but a quick review of the patch and the existing comments lead me to believe that commiting it as is wouldn't be the cleanest way to deal with this problem:

          1) As Mark mentioned, a "disableCoord" option has been added to BooleanQuery which makes the ExpandedTermsQuery class in this patch unneccessary. furthermore, the ExpandedTermsQuery class is (in my opinion) broken because it ignores any user specific Similarity completely (instead of just ignoring the coord factor)

          2) This patch does not include any test cases.

          Furthermore, this patch (assuming it does what it says it does) would fundementally alter the scoring of a FuzzyQuery – the new scoring may seem like the right thing to do for you and for Mark, but without a larger concensus I'm a little worried that perhaps there are just as many FuzzyQuery users out there who are happy with the current behavior and would consider this change a bug. So any "fix" for this should either be configurable, or have an overwelming amount of support that the change is the "right" thing to do.

          (The main issue seems to be that terms with low idf have a larger impact, and that results in rare mis-spellings getting higher scores – but assuming "clean data" with no mis-spellings, scoring "rare" terms higher seems like the ideal behavior, correct?)

          Show
          Hoss Man added a comment - I'm not very familiar with this issue, but a quick review of the patch and the existing comments lead me to believe that commiting it as is wouldn't be the cleanest way to deal with this problem: 1) As Mark mentioned, a "disableCoord" option has been added to BooleanQuery which makes the ExpandedTermsQuery class in this patch unneccessary. furthermore, the ExpandedTermsQuery class is (in my opinion) broken because it ignores any user specific Similarity completely (instead of just ignoring the coord factor) 2) This patch does not include any test cases. Furthermore, this patch (assuming it does what it says it does) would fundementally alter the scoring of a FuzzyQuery – the new scoring may seem like the right thing to do for you and for Mark, but without a larger concensus I'm a little worried that perhaps there are just as many FuzzyQuery users out there who are happy with the current behavior and would consider this change a bug. So any "fix" for this should either be configurable, or have an overwelming amount of support that the change is the "right" thing to do. (The main issue seems to be that terms with low idf have a larger impact, and that results in rare mis-spellings getting higher scores – but assuming "clean data" with no mis-spellings, scoring "rare" terms higher seems like the ideal behavior, correct?)
          Hide
          Marco Dissel added a comment -

          Why isn't the patch implemented? My users are finding the current result of fuzzy searched not correct (as stated in the sample provided by Mark)

          Show
          Marco Dissel added a comment - Why isn't the patch implemented? My users are finding the current result of fuzzy searched not correct (as stated in the sample provided by Mark)
          Hide
          Mark Harwood added a comment -

          This has been partially addressed.

          Issue 1, the coord factor has been rectified with the introduction of new constructor BooleanQuery(boolean disableCoord).

          Issue 2 (idf ranking) has not been addressed.

          Show
          Mark Harwood added a comment - This has been partially addressed. Issue 1, the coord factor has been rectified with the introduction of new constructor BooleanQuery(boolean disableCoord). Issue 2 (idf ranking) has not been addressed.
          Hide
          Mark Harwood added a comment -

          Created an attachment (id=13887)
          Fix for issues identified in this bug

          Fix for the fuzzy scoring issues. Introduced new subclasses for BooleanQuery
          and TermQuery - ExpandedTermsQuery and ExpandedTermQuery respectively.
          "Fuzzy" queries (range, multiterm and fuzzy) now use these subclasses to make
          use of their new Similarity implementations.

          Show
          Mark Harwood added a comment - Created an attachment (id=13887) Fix for issues identified in this bug Fix for the fuzzy scoring issues. Introduced new subclasses for BooleanQuery and TermQuery - ExpandedTermsQuery and ExpandedTermQuery respectively. "Fuzzy" queries (range, multiterm and fuzzy) now use these subclasses to make use of their new Similarity implementations.

            People

            • Assignee:
              Unassigned
              Reporter:
              Mark Harwood
            • Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development