Lucene - Core
  1. Lucene - Core
  2. LUCENE-3833

Add an operator to query parser for term quorum (ie: BooleanQuery.setMinimumNumberShouldMatch)

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/queryparser
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      A project I'm working on requires term quorum searching with stemming turned off. The users are accostomed to Sphinx search, and thus expect a query like [ A AND (B C D)/2 ] to return only documents that contain A or at least two of B, C or D.

      So this document would match:
      a b c

      But this one wouldn't:
      a b

      This can be a useful form of fuzzy searching, and I think we support it via the MM parameter, but we lack a user-facing operator for this. It would be great to add it.

        Issue Links

          Activity

          Hide
          Naomi Dushay added a comment -

          Is this related to https://issues.apache.org/jira/browse/SOLR-3589 ? Would a fix here fix that problem as well? SOLR-3589 is absolutely killing multi-lingual CJK index searching such as Hathi trust and Stanford Libraries.

          Show
          Naomi Dushay added a comment - Is this related to https://issues.apache.org/jira/browse/SOLR-3589 ? Would a fix here fix that problem as well? SOLR-3589 is absolutely killing multi-lingual CJK index searching such as Hathi trust and Stanford Libraries.
          Hide
          Mike added a comment -

          Three thoughts:
          1. Do we need to set a review flag, or are we waiting for something else to get this in?
          2. Ignoring when not a boolean makes sense to me.
          3. I'd also advocate for ignoring when a non-integer. Better to fail silently when queries don't make sense than to throw an error. (At least that's my philosophy - don't know about Solr's.)

          Show
          Mike added a comment - Three thoughts: 1. Do we need to set a review flag, or are we waiting for something else to get this in? 2. Ignoring when not a boolean makes sense to me. 3. I'd also advocate for ignoring when a non-integer. Better to fail silently when queries don't make sense than to throw an error. (At least that's my philosophy - don't know about Solr's.)
          Hide
          Juan Grande added a comment -

          Hi,

          I'm attaching a patch that implements this feature for the classic query parser in the trunk. I'm still working on a solution for the flexible.standard implementation.

          The syntax is the same as for sloppy phrases. Some things need to be decided:

          • What should happen when this is applied to something that isn't a boolean query? For example: ([* TO *])~3. In this case, the patch simply ignores the mm.
          • Because in the grammar definition I'm using the same production as for sloppy phrases, decimal values are allowed by the syntax. What should we do when the user enters a non-integer number? Throw a ParseException maybe? Currently, the patch also ignores the mm value in this case.

          I don't really know much about JavaCC, I just learnt the basics to do the patch, so feel free to correct any possible mistakes.

          In this patch I'm removing a constructor that was manually added to ParseException, so it doesn't fail when the sources are regenerated.

          – Juan

          Show
          Juan Grande added a comment - Hi, I'm attaching a patch that implements this feature for the classic query parser in the trunk. I'm still working on a solution for the flexible.standard implementation. The syntax is the same as for sloppy phrases. Some things need to be decided: What should happen when this is applied to something that isn't a boolean query? For example: ( [* TO *] )~3. In this case, the patch simply ignores the mm. Because in the grammar definition I'm using the same production as for sloppy phrases, decimal values are allowed by the syntax. What should we do when the user enters a non-integer number? Throw a ParseException maybe? Currently, the patch also ignores the mm value in this case. I don't really know much about JavaCC, I just learnt the basics to do the patch, so feel free to correct any possible mistakes. In this patch I'm removing a constructor that was manually added to ParseException, so it doesn't fail when the sources are regenerated. – Juan
          Hide
          Mike added a comment -

          Thanks Hoss. I'd advocate for the slash syntax since that's what I believe Lexis and West use, but the tilde makes sense too for the reasons you mention.

          Show
          Mike added a comment - Thanks Hoss. I'd advocate for the slash syntax since that's what I believe Lexis and West use, but the tilde makes sense too for the reasons you mention.
          Hide
          Hoss Man added a comment -

          Moved issue from Solr to Lucene since it should really be dealt with in the underlying query parser(s).

          I would suggest that the "~" syntax makes more sense then "/" for this (ie: {{ A AND (B C D)~2}} since...

          • "/" was recently added to the query parser as a metacharacter for "quoting" regex queries and the extremely different meanings might confuse people
          • "~" already serves nearly the same purpose for phrases (slop) and fuzzy queries (amount of fuzziness) ... it seems a natural way to express "how many" of the clauses you want to match.
          Show
          Hoss Man added a comment - Moved issue from Solr to Lucene since it should really be dealt with in the underlying query parser(s). I would suggest that the "~" syntax makes more sense then "/" for this (ie: {{ A AND (B C D)~2}} since... "/" was recently added to the query parser as a metacharacter for "quoting" regex queries and the extremely different meanings might confuse people "~" already serves nearly the same purpose for phrases (slop) and fuzzy queries (amount of fuzziness) ... it seems a natural way to express "how many" of the clauses you want to match.
          Hide
          Mike added a comment -

          I don't know. We have the MM parameter, but not an operator. I don't know where Lucene's query parser ends and where edismax begins. Happy to change this to a Lucene issue if that makes sense though.

          Anybody know definitively where this should go?

          Show
          Mike added a comment - I don't know. We have the MM parameter, but not an operator. I don't know where Lucene's query parser ends and where edismax begins. Happy to change this to a Lucene issue if that makes sense though. Anybody know definitively where this should go?
          Hide
          Jan Høydahl added a comment -

          I feel such core operators should be implemented on the Lucene level first. Then let it bubble up into Solr and eDisMax. Yes?

          Show
          Jan Høydahl added a comment - I feel such core operators should be implemented on the Lucene level first. Then let it bubble up into Solr and eDisMax. Yes?
          Hide
          Mike added a comment -

          I'd suggest we follow the syntax of Sphinx[1], and require that this be used immediately after (), with a slash and then a count. Pretty sure this doesn't conflict with anything we've already got.

          So queries would look essentially like this:
          (a b c d)/2 e

          [1]: http://sphinxsearch.com/docs/current.html#extended-syntax

          Show
          Mike added a comment - I'd suggest we follow the syntax of Sphinx [1] , and require that this be used immediately after (), with a slash and then a count. Pretty sure this doesn't conflict with anything we've already got. So queries would look essentially like this: (a b c d)/2 e [1] : http://sphinxsearch.com/docs/current.html#extended-syntax
          Hide
          Jan Høydahl added a comment -

          Linking this to edismax mother task

          Show
          Jan Høydahl added a comment - Linking this to edismax mother task
          Hide
          Jan Høydahl added a comment -

          What would be the formal syntax of this quorum operator? Would it only be allowed immediately after a set of ()'s? Would think that it would be possible to introduce this into eDismax..

          Show
          Jan Høydahl added a comment - What would be the formal syntax of this quorum operator? Would it only be allowed immediately after a set of ()'s? Would think that it would be possible to introduce this into eDismax..
          Hide
          Mike added a comment -

          Oops. Please ignore the bit about stemming above. Poor copy/paste on my behalf.

          Show
          Mike added a comment - Oops. Please ignore the bit about stemming above. Poor copy/paste on my behalf.

            People

            • Assignee:
              Unassigned
              Reporter:
              Mike
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 2h
                2h
                Remaining:
                Remaining Estimate - 2h
                2h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development