Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3833

Add an operator to query parser for term quorum (ie: BooleanQuery.setMinimumNumberShouldMatch)

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • core/queryparser
    • None
    • New

    Description

      A project I'm working on requires term quorum searching with stemming turned off. The users are accostomed to Sphinx search, and thus expect a query like [ A AND (B C D)/2 ] to return only documents that contain A or at least two of B, C or D.

      So this document would match:
      a b c

      But this one wouldn't:
      a b

      This can be a useful form of fuzzy searching, and I think we support it via the MM parameter, but we lack a user-facing operator for this. It would be great to add it.

      Attachments

        1. LUCENE-3833.patch
          13 kB
          Juan Grande

        Issue Links

          Activity

            mlissner Mike Lissner added a comment -

            Oops. Please ignore the bit about stemming above. Poor copy/paste on my behalf.

            mlissner Mike Lissner added a comment - Oops. Please ignore the bit about stemming above. Poor copy/paste on my behalf.
            janhoy Jan Høydahl added a comment -

            What would be the formal syntax of this quorum operator? Would it only be allowed immediately after a set of ()'s? Would think that it would be possible to introduce this into eDismax..

            janhoy Jan Høydahl added a comment - What would be the formal syntax of this quorum operator? Would it only be allowed immediately after a set of ()'s? Would think that it would be possible to introduce this into eDismax..
            janhoy Jan Høydahl added a comment -

            Linking this to edismax mother task

            janhoy Jan Høydahl added a comment - Linking this to edismax mother task
            mlissner Mike Lissner added a comment -

            I'd suggest we follow the syntax of Sphinx[1], and require that this be used immediately after (), with a slash and then a count. Pretty sure this doesn't conflict with anything we've already got.

            So queries would look essentially like this:
            (a b c d)/2 e

            [1]: http://sphinxsearch.com/docs/current.html#extended-syntax

            mlissner Mike Lissner added a comment - I'd suggest we follow the syntax of Sphinx [1] , and require that this be used immediately after (), with a slash and then a count. Pretty sure this doesn't conflict with anything we've already got. So queries would look essentially like this: (a b c d)/2 e [1] : http://sphinxsearch.com/docs/current.html#extended-syntax
            janhoy Jan Høydahl added a comment -

            I feel such core operators should be implemented on the Lucene level first. Then let it bubble up into Solr and eDisMax. Yes?

            janhoy Jan Høydahl added a comment - I feel such core operators should be implemented on the Lucene level first. Then let it bubble up into Solr and eDisMax. Yes?
            mlissner Mike Lissner added a comment -

            I don't know. We have the MM parameter, but not an operator. I don't know where Lucene's query parser ends and where edismax begins. Happy to change this to a Lucene issue if that makes sense though.

            Anybody know definitively where this should go?

            mlissner Mike Lissner added a comment - I don't know. We have the MM parameter, but not an operator. I don't know where Lucene's query parser ends and where edismax begins. Happy to change this to a Lucene issue if that makes sense though. Anybody know definitively where this should go?

            Moved issue from Solr to Lucene since it should really be dealt with in the underlying query parser(s).

            I would suggest that the "~" syntax makes more sense then "/" for this (ie: {{ A AND (B C D)~2}} since...

            • "/" was recently added to the query parser as a metacharacter for "quoting" regex queries and the extremely different meanings might confuse people
            • "~" already serves nearly the same purpose for phrases (slop) and fuzzy queries (amount of fuzziness) ... it seems a natural way to express "how many" of the clauses you want to match.
            hossman Chris M. Hostetter added a comment - Moved issue from Solr to Lucene since it should really be dealt with in the underlying query parser(s). I would suggest that the "~" syntax makes more sense then "/" for this (ie: {{ A AND (B C D)~2}} since... "/" was recently added to the query parser as a metacharacter for "quoting" regex queries and the extremely different meanings might confuse people "~" already serves nearly the same purpose for phrases (slop) and fuzzy queries (amount of fuzziness) ... it seems a natural way to express "how many" of the clauses you want to match.
            mlissner Mike Lissner added a comment -

            Thanks Hoss. I'd advocate for the slash syntax since that's what I believe Lexis and West use, but the tilde makes sense too for the reasons you mention.

            mlissner Mike Lissner added a comment - Thanks Hoss. I'd advocate for the slash syntax since that's what I believe Lexis and West use, but the tilde makes sense too for the reasons you mention.
            juangrande Juan Grande added a comment -

            Hi,

            I'm attaching a patch that implements this feature for the classic query parser in the trunk. I'm still working on a solution for the flexible.standard implementation.

            The syntax is the same as for sloppy phrases. Some things need to be decided:

            • What should happen when this is applied to something that isn't a boolean query? For example: ([* TO *])~3. In this case, the patch simply ignores the mm.
            • Because in the grammar definition I'm using the same production as for sloppy phrases, decimal values are allowed by the syntax. What should we do when the user enters a non-integer number? Throw a ParseException maybe? Currently, the patch also ignores the mm value in this case.

            I don't really know much about JavaCC, I just learnt the basics to do the patch, so feel free to correct any possible mistakes.

            In this patch I'm removing a constructor that was manually added to ParseException, so it doesn't fail when the sources are regenerated.

            – Juan

            juangrande Juan Grande added a comment - Hi, I'm attaching a patch that implements this feature for the classic query parser in the trunk. I'm still working on a solution for the flexible.standard implementation. The syntax is the same as for sloppy phrases. Some things need to be decided: What should happen when this is applied to something that isn't a boolean query? For example: ( [* TO *] )~3. In this case, the patch simply ignores the mm. Because in the grammar definition I'm using the same production as for sloppy phrases, decimal values are allowed by the syntax. What should we do when the user enters a non-integer number? Throw a ParseException maybe? Currently, the patch also ignores the mm value in this case. I don't really know much about JavaCC, I just learnt the basics to do the patch, so feel free to correct any possible mistakes. In this patch I'm removing a constructor that was manually added to ParseException, so it doesn't fail when the sources are regenerated. – Juan
            mlissner Mike Lissner added a comment -

            Three thoughts:
            1. Do we need to set a review flag, or are we waiting for something else to get this in?
            2. Ignoring when not a boolean makes sense to me.
            3. I'd also advocate for ignoring when a non-integer. Better to fail silently when queries don't make sense than to throw an error. (At least that's my philosophy - don't know about Solr's.)

            mlissner Mike Lissner added a comment - Three thoughts: 1. Do we need to set a review flag, or are we waiting for something else to get this in? 2. Ignoring when not a boolean makes sense to me. 3. I'd also advocate for ignoring when a non-integer. Better to fail silently when queries don't make sense than to throw an error. (At least that's my philosophy - don't know about Solr's.)
            ndushay Naomi Dushay added a comment -

            Is this related to https://issues.apache.org/jira/browse/SOLR-3589 ? Would a fix here fix that problem as well? SOLR-3589 is absolutely killing multi-lingual CJK index searching such as Hathi trust and Stanford Libraries.

            ndushay Naomi Dushay added a comment - Is this related to https://issues.apache.org/jira/browse/SOLR-3589 ? Would a fix here fix that problem as well? SOLR-3589 is absolutely killing multi-lingual CJK index searching such as Hathi trust and Stanford Libraries.
            tomoko Tomoko Uchida added a comment -

            This issue was moved to GitHub issue: #4906.

            tomoko Tomoko Uchida added a comment - This issue was moved to GitHub issue: #4906 .

            People

              Unassigned Unassigned
              mlissner Mike Lissner
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - 2h
                  2h
                  Remaining:
                  Remaining Estimate - 2h
                  2h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified