Solr
  1. Solr
  2. SOLR-629

Fuzzy search with DisMax request handler

    Details

    • Type: Improvement Improvement
    • Status: Reopened
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.3
    • Fix Version/s: None
    • Component/s: query parsers
    • Labels:
      None

      Description

      The DisMax search handler doesn't support fuzzy queries which would be quite useful for our usage of Solr and from what I've seen on the list, it's something several people would like to have.

      Following this discussion http://markmail.org/message/tx6kqr7ga6ponefa#query:solr%20dismax%20fuzzy+page:1+mid:c4pciq6rlr4dwtgm+state:results , I added the ability to add fuzzy query field in the qf parameter. I kept the patch as conservative as possible.

      The syntax supported is: fieldOne^2.3 fieldTwo~0.3 fieldThree~0.2^-0.4 fieldFour as discussed in the above thread.

      The recursive query aliasing should work even with fuzzy query fields using a very simple rule: the aliased fields inherit the minSimilarity of their parent, combined with their own one if they have one.

      Only the qf parameter support this syntax atm. I suppose we should make it usable in pf too. Any opinion?

      Comments are very welcome, I'll spend the time needed to put this patch in good shape.

      Thanks.

      1. dismax_fuzzy_query_field.v0.1.diff
        11 kB
        Guillaume Smet
      2. dismax_fuzzy_query_field.v0.1.diff
        11 kB
        Guillaume Smet

        Issue Links

          Activity

          Hide
          Erick Erickson added a comment -

          2013 Old JIRA cleanup

          Show
          Erick Erickson added a comment - 2013 Old JIRA cleanup
          Hide
          Mikelis Zalais added a comment -

          Walter, could you post the patch?

          Show
          Mikelis Zalais added a comment - Walter, could you post the patch?
          Hide
          Walter Underwood added a comment -

          I'm at Chegg now. MarkLogic uses MarkLogic.

          Show
          Walter Underwood added a comment - I'm at Chegg now. MarkLogic uses MarkLogic.
          Hide
          David Smiley added a comment -

          By "we're using locally" do you mean that MarkLogic is using Solr instead of their(your) own product?

          Show
          David Smiley added a comment - By "we're using locally" do you mean that MarkLogic is using Solr instead of their(your) own product?
          Hide
          Walter Underwood added a comment -

          I've updated this to work with Solr 3.3.0, the version we're using locally. Is there interest in getting this back into trunk?

          Show
          Walter Underwood added a comment - I've updated this to work with Solr 3.3.0, the version we're using locally. Is there interest in getting this back into trunk?
          Hide
          Chris Williams added a comment -

          sorry, it was a typo. I was using 0.6 for the fuzziness, not 0.06.

          (I have about a week and half experience with solr right now, so bare with me)
          Assuming you're right about it being the default behavior, is there any alternative way to get it to work? Any fuzzy search with my example above that has a stop word in it doesn't return any results. What kind of field type do you run fuzzy search on? Do you basically just run it on a field that has no filters on it?

          thanks,
          Chris

          Show
          Chris Williams added a comment - sorry, it was a typo. I was using 0.6 for the fuzziness, not 0.06. (I have about a week and half experience with solr right now, so bare with me) Assuming you're right about it being the default behavior, is there any alternative way to get it to work? Any fuzzy search with my example above that has a stop word in it doesn't return any results. What kind of field type do you run fuzzy search on? Do you basically just run it on a field that has no filters on it? thanks, Chris
          Hide
          Guillaume Smet added a comment -

          FYI: the patch didn't seem to apply cleanly on 1.3, but worked fine on 1.4

          The old version of the patch which is still attached should work with 1.3. At least, I use it on a pre 1.3 version.

          The new one is rebased on 1.4 but is the exact same patch.

          I get this as the parsed query:
          "parsedquery_toString"=>"+(((title_words:the~0.6)~0.01 (title_words:game~0.6)~0.01)~2) ()"
          (I don't want it running anything on the word 'the' because its a stop word)

          AFAIK, it's the standard behaviour for fuzziness (and for wildcard queries). The stop word isn't removed because the~0.06 != the, it might be another word.

          Could any Solr guy confirm?

          Note that 0.06 is really too low IMHO. I usually use 0.8 or 0.7 for fuzziness.


          Guillaume

          Show
          Guillaume Smet added a comment - FYI: the patch didn't seem to apply cleanly on 1.3, but worked fine on 1.4 The old version of the patch which is still attached should work with 1.3. At least, I use it on a pre 1.3 version. The new one is rebased on 1.4 but is the exact same patch. I get this as the parsed query: "parsedquery_toString"=>"+(((title_words:the~0.6)~0.01 (title_words:game~0.6)~0.01)~2) ()" (I don't want it running anything on the word 'the' because its a stop word) AFAIK, it's the standard behaviour for fuzziness (and for wildcard queries). The stop word isn't removed because the~0.06 != the, it might be another word. Could any Solr guy confirm? Note that 0.06 is really too low IMHO. I usually use 0.8 or 0.7 for fuzziness. – Guillaume
          Hide
          Chris Williams added a comment -

          Hi,
          FYI: the patch didn't seem to apply cleanly on 1.3, but worked fine on 1.4

          Anyways, I'm having some trouble with this patch. It doesn't seem to respect any of my query filters.

          For example, I have a dismax query
          where q=the game
          where qf = 'title_words~.06'

          where my 'title_words' field is:
          <fieldType name="textExactWSTokenized" class="solr.TextField" positionIncrementGap="100">
          <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
          <filter class="solr.ISOLatin1AccentFilterFactory"/>
          <filter class="solr.StandardFilterFactory"/>
          <filter class="solr.TrimFilterFactory" />
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
          </analyzer>
          </fieldType>

          I get this as the parsed query:
          "parsedquery_toString"=>"+(((title_words:the~0.6)~0.01 (title_words:game~0.6)~0.01)~2) ()"
          (I don't want it running anything on the word 'the' because its a stop word)

          Yet if I change qf to just 'title_words' and remove the fuzziness, same query text, I get this:
          "parsedquery_toString"=>"+(((title_words:game)~0.01)~1) ()"
          (which is what I want)

          Show
          Chris Williams added a comment - Hi, FYI: the patch didn't seem to apply cleanly on 1.3, but worked fine on 1.4 Anyways, I'm having some trouble with this patch. It doesn't seem to respect any of my query filters. For example, I have a dismax query where q=the game where qf = 'title_words~.06' where my 'title_words' field is: <fieldType name="textExactWSTokenized" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.ISOLatin1AccentFilterFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.TrimFilterFactory" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> I get this as the parsed query: "parsedquery_toString"=>"+(((title_words:the~0.6)~0.01 (title_words:game~0.6)~0.01)~2) ()" (I don't want it running anything on the word 'the' because its a stop word) Yet if I change qf to just 'title_words' and remove the fuzziness, same query text, I get this: "parsedquery_toString"=>"+(((title_words:game)~0.01)~1) ()" (which is what I want)
          Hide
          Guillaume Smet added a comment -

          Hi Otis,

          The proposed syntax is taken from the unit test which is based on the existing one ( see testParseFieldBoosts() in http://svn.apache.org/viewvc/lucene/solr/trunk/src/test/org/apache/solr/util/SolrPluginUtilsTest.java?revision=701485&view=markup ). The existing one contains a negative boost. So does the new one. I didn't change the way Solr parses the values.
          Perhaps we need to be more strict about it?

          There is still an unanswered question from my initial proposal:
          "Only the qf parameter supports this syntax atm. I suppose we should make it usable in pf too. Any opinion?"

          That said, it's probably better to validate the general approach of the patch before thinking about generalizing it.


          Guillaume

          Show
          Guillaume Smet added a comment - Hi Otis, The proposed syntax is taken from the unit test which is based on the existing one ( see testParseFieldBoosts() in http://svn.apache.org/viewvc/lucene/solr/trunk/src/test/org/apache/solr/util/SolrPluginUtilsTest.java?revision=701485&view=markup ). The existing one contains a negative boost. So does the new one. I didn't change the way Solr parses the values. Perhaps we need to be more strict about it? There is still an unanswered question from my initial proposal: "Only the qf parameter supports this syntax atm. I suppose we should make it usable in pf too. Any opinion?" That said, it's probably better to validate the general approach of the patch before thinking about generalizing it. – Guillaume
          Hide
          Otis Gospodnetic added a comment -

          Mikelis: have you tried it? Does it work well and as described? Please do and leave your feedback here (or fixes in form of another patch).

          I haven't looked at the patch, but I like the example syntax.

          Question about "fieldThree~0.2^-0.4" – is that a negative boost? huh?

          Show
          Otis Gospodnetic added a comment - Mikelis: have you tried it? Does it work well and as described? Please do and leave your feedback here (or fixes in form of another patch). I haven't looked at the patch, but I like the example syntax. Question about "fieldThree~0.2^-0.4" – is that a negative boost? huh?
          Hide
          Mikelis Zalais added a comment -

          Hi, is there any progress with this?

          Show
          Mikelis Zalais added a comment - Hi, is there any progress with this?
          Hide
          Guillaume Smet added a comment -

          Here is the same patch updated to trunk to resolve a few conflicts.

          It would be nice to have some feedback as it could be a nice enhancement for DisMax in Solr 1.4. I can rework it if needed.

          We run several instances of Solr with this patch for more than 8 months now as we really needed fuzzy search with DisMax.

          Thanks.

          Show
          Guillaume Smet added a comment - Here is the same patch updated to trunk to resolve a few conflicts. It would be nice to have some feedback as it could be a nice enhancement for DisMax in Solr 1.4. I can rework it if needed. We run several instances of Solr with this patch for more than 8 months now as we really needed fuzzy search with DisMax. Thanks.

            People

            • Assignee:
              Unassigned
              Reporter:
              Guillaume Smet
            • Votes:
              11 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development