[LUCENE-3833] Add an operator to query parser for term quorum (ie: BooleanQuery.setMinimumNumberShouldMatch) - ASF JIRA

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: core/queryparser
Labels:
None

Lucene Fields:

New

Description

A project I'm working on requires term quorum searching with stemming turned off. The users are accostomed to Sphinx search, and thus expect a query like [ A AND (B C D)/2 ] to return only documents that contain A or at least two of B, C or D.

So this document would match:
a b c

But this one wouldn't:
a b

This can be a useful form of fuzzy searching, and I think we support it via the MM parameter, but we lack a user-facing operator for this. It would be great to add it.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-3833.patch
27/Mar/12 21:20
13 kB
Juan Grande

Issue Links

is part of

SOLR-2368 Improve extended dismax (edismax) parser

Open

relates to

SOLR-3028 Support for additional query operators (feature parity request)

Open

Activity

Ascending order - Click to sort in descending order

Mike Lissner added a comment - 06/Feb/12 02:38

Oops. Please ignore the bit about stemming above. Poor copy/paste on my behalf.

Mike Lissner added a comment - 06/Feb/12 02:38 Oops. Please ignore the bit about stemming above. Poor copy/paste on my behalf.

Jan Høydahl added a comment - 06/Feb/12 18:28

What would be the formal syntax of this quorum operator? Would it only be allowed immediately after a set of ()'s? Would think that it would be possible to introduce this into eDismax..

Jan Høydahl added a comment - 06/Feb/12 18:28 What would be the formal syntax of this quorum operator? Would it only be allowed immediately after a set of ()'s? Would think that it would be possible to introduce this into eDismax..

Jan Høydahl added a comment - 06/Feb/12 18:29

Linking this to edismax mother task

Jan Høydahl added a comment - 06/Feb/12 18:29 Linking this to edismax mother task

Mike Lissner added a comment - 07/Feb/12 04:35

I'd suggest we follow the syntax of Sphinx[1], and require that this be used immediately after (), with a slash and then a count. Pretty sure this doesn't conflict with anything we've already got.

So queries would look essentially like this:
(a b c d)/2 e

[1]: http://sphinxsearch.com/docs/current.html#extended-syntax

Mike Lissner added a comment - 07/Feb/12 04:35 I'd suggest we follow the syntax of Sphinx [1] , and require that this be used immediately after (), with a slash and then a count. Pretty sure this doesn't conflict with anything we've already got. So queries would look essentially like this: (a b c d)/2 e [1] : http://sphinxsearch.com/docs/current.html#extended-syntax

Jan Høydahl added a comment - 07/Feb/12 09:56

I feel such core operators should be implemented on the Lucene level first. Then let it bubble up into Solr and eDisMax. Yes?

Jan Høydahl added a comment - 07/Feb/12 09:56 I feel such core operators should be implemented on the Lucene level first. Then let it bubble up into Solr and eDisMax. Yes?

Mike Lissner added a comment - 08/Feb/12 06:30

I don't know. We have the MM parameter, but not an operator. I don't know where Lucene's query parser ends and where edismax begins. Happy to change this to a Lucene issue if that makes sense though.

Anybody know definitively where this should go?

Mike Lissner added a comment - 08/Feb/12 06:30 I don't know. We have the MM parameter, but not an operator. I don't know where Lucene's query parser ends and where edismax begins. Happy to change this to a Lucene issue if that makes sense though. Anybody know definitively where this should go?

Chris M. Hostetter added a comment - 29/Feb/12 02:33

Moved issue from Solr to Lucene since it should really be dealt with in the underlying query parser(s).

I would suggest that the "~" syntax makes more sense then "/" for this (ie: {{ A AND (B C D)~2}} since...

"/" was recently added to the query parser as a metacharacter for "quoting" regex queries and the extremely different meanings might confuse people
"~" already serves nearly the same purpose for phrases (slop) and fuzzy queries (amount of fuzziness) ... it seems a natural way to express "how many" of the clauses you want to match.

Chris M. Hostetter added a comment - 29/Feb/12 02:33 Moved issue from Solr to Lucene since it should really be dealt with in the underlying query parser(s). I would suggest that the "~" syntax makes more sense then "/" for this (ie: {{ A AND (B C D)~2}} since... "/" was recently added to the query parser as a metacharacter for "quoting" regex queries and the extremely different meanings might confuse people "~" already serves nearly the same purpose for phrases (slop) and fuzzy queries (amount of fuzziness) ... it seems a natural way to express "how many" of the clauses you want to match.

Mike Lissner added a comment - 29/Feb/12 04:46

Thanks Hoss. I'd advocate for the slash syntax since that's what I believe Lexis and West use, but the tilde makes sense too for the reasons you mention.

Mike Lissner added a comment - 29/Feb/12 04:46 Thanks Hoss. I'd advocate for the slash syntax since that's what I believe Lexis and West use, but the tilde makes sense too for the reasons you mention.

Juan Grande added a comment - 27/Mar/12 21:20

Hi,

I'm attaching a patch that implements this feature for the classic query parser in the trunk. I'm still working on a solution for the flexible.standard implementation.

The syntax is the same as for sloppy phrases. Some things need to be decided:

What should happen when this is applied to something that isn't a boolean query? For example: ([* TO *])~3. In this case, the patch simply ignores the mm.
Because in the grammar definition I'm using the same production as for sloppy phrases, decimal values are allowed by the syntax. What should we do when the user enters a non-integer number? Throw a ParseException maybe? Currently, the patch also ignores the mm value in this case.

I don't really know much about JavaCC, I just learnt the basics to do the patch, so feel free to correct any possible mistakes.

In this patch I'm removing a constructor that was manually added to ParseException, so it doesn't fail when the sources are regenerated.

– Juan

Juan Grande added a comment - 27/Mar/12 21:20 Hi, I'm attaching a patch that implements this feature for the classic query parser in the trunk. I'm still working on a solution for the flexible.standard implementation. The syntax is the same as for sloppy phrases. Some things need to be decided: What should happen when this is applied to something that isn't a boolean query? For example: ( [* TO *] )~3. In this case, the patch simply ignores the mm. Because in the grammar definition I'm using the same production as for sloppy phrases, decimal values are allowed by the syntax. What should we do when the user enters a non-integer number? Throw a ParseException maybe? Currently, the patch also ignores the mm value in this case. I don't really know much about JavaCC, I just learnt the basics to do the patch, so feel free to correct any possible mistakes. In this patch I'm removing a constructor that was manually added to ParseException, so it doesn't fail when the sources are regenerated. – Juan

Mike Lissner added a comment - 02/Apr/12 21:06

Three thoughts:
1. Do we need to set a review flag, or are we waiting for something else to get this in?
2. Ignoring when not a boolean makes sense to me.
3. I'd also advocate for ignoring when a non-integer. Better to fail silently when queries don't make sense than to throw an error. (At least that's my philosophy - don't know about Solr's.)

Mike Lissner added a comment - 02/Apr/12 21:06 Three thoughts: 1. Do we need to set a review flag, or are we waiting for something else to get this in? 2. Ignoring when not a boolean makes sense to me. 3. I'd also advocate for ignoring when a non-integer. Better to fail silently when queries don't make sense than to throw an error. (At least that's my philosophy - don't know about Solr's.)

Naomi Dushay added a comment - 27/Oct/12 21:12

Is this related to https://issues.apache.org/jira/browse/SOLR-3589 ? Would a fix here fix that problem as well? ~~SOLR-3589~~ is absolutely killing multi-lingual CJK index searching such as Hathi trust and Stanford Libraries.

Naomi Dushay added a comment - 27/Oct/12 21:12 Is this related to https://issues.apache.org/jira/browse/SOLR-3589 ? Would a fix here fix that problem as well? SOLR-3589 is absolutely killing multi-lingual CJK index searching such as Hathi trust and Stanford Libraries.

Tomoko Uchida added a comment - 28/Aug/22 13:09

This issue was moved to GitHub issue: #4906.

Tomoko Uchida added a comment - 28/Aug/22 13:09 This issue was moved to GitHub issue: #4906 .

People

Assignee:: Unassigned

Reporter:: Mike Lissner

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Feb/12 02:36

Updated:: 28/Aug/22 13:09

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified

Lucene - Core

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Time Tracking