Solr
  1. Solr
  2. SOLR-1980

Implement boundary match support

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Sometimes you need to specify that a query should match only at the start or end of a field, or be an exact match.

      We should have a query syntax for boundary match, preferably on a lowest possible level such as the "lucene" query parser.

        Activity

        Hide
        Hoss Man added a comment -

        removing fixVersion=4.0 since there is no evidence that anyone is currently working on this issue. (this can certainly be revisited if volunteers step forward)

        Show
        Hoss Man added a comment - removing fixVersion=4.0 since there is no evidence that anyone is currently working on this issue. (this can certainly be revisited if volunteers step forward)
        Hide
        Robert Muir added a comment -

        rmuir20120906-bulk-40-change

        Show
        Robert Muir added a comment - rmuir20120906-bulk-40-change
        Hide
        Hoss Man added a comment -

        bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

        Show
        Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
        Hide
        Jan Høydahl added a comment -

        Shortening down the description field. I removed these paragraphs:

        Proposed way of implmementation is through a new BoundaryMatchTokenFilter which behaves like this:
        On the index side it inserts special unique tokens at beginning and end of field. These could be some weird unicode sequence.
        On the query side, it looks for the first character matching "^" or the last character matching "$" and replaces them with the special tokens.

        Show
        Jan Høydahl added a comment - Shortening down the description field. I removed these paragraphs: Proposed way of implmementation is through a new BoundaryMatchTokenFilter which behaves like this: On the index side it inserts special unique tokens at beginning and end of field. These could be some weird unicode sequence. On the query side, it looks for the first character matching "^" or the last character matching "$" and replaces them with the special tokens.
        Hide
        Jan Høydahl added a comment -

        Tagging this for 4.0, hoping to revive some work on it...

        Btw. Any comments to my last syntax suggestion, utilizing term positions @N:M ?

        Show
        Jan Høydahl added a comment - Tagging this for 4.0, hoping to revive some work on it... Btw. Any comments to my last syntax suggestion, utilizing term positions @N:M ?
        Hide
        Jan Høydahl added a comment -

        I'm sure I can get it working the way I started, using CharFilter, however perhaps it's possible to implement in a more generic and Lucene-like query syntax utilizing position info from the index:

         title:"quick fox"@N:M
        

        This would mean that the phrase must be anchored between N'th and M'th token position in the field. Negative values for N/M would mean relative to the end. Thus "^quick fox$" could be written

         title:"quick fox"@0:-0
        

        Or if you require the phrase to be within first 10 words OR last 10 words:

         title:("quick fox"@0:10 OR "quick fox"@-10:-0)
        

        Requiring a term to be exactly @ position 3 would be:

         title:fox@3:3
        

        If this syntax is feasible, we could use same syntax in eDisMax's pf param in order to tell it to add a position constraint when forming the pf part of the query:

         pf=title@0:-0
        

        This would only generate a phrase match on title if the phrase is an exact match of the whole field.

        Potential issues with multi-valued fields? Is the field delimiter clearly marked or is it only an increment gap?

        Would it be easy to parse such a syntax and generate a Lucene query with the position constraints?

        Show
        Jan Høydahl added a comment - I'm sure I can get it working the way I started, using CharFilter, however perhaps it's possible to implement in a more generic and Lucene-like query syntax utilizing position info from the index: title: "quick fox" @N:M This would mean that the phrase must be anchored between N'th and M'th token position in the field. Negative values for N/M would mean relative to the end. Thus "^quick fox$" could be written title: "quick fox" @0:-0 Or if you require the phrase to be within first 10 words OR last 10 words: title:( "quick fox" @0:10 OR "quick fox" @-10:-0) Requiring a term to be exactly @ position 3 would be: title:fox@3:3 If this syntax is feasible, we could use same syntax in eDisMax's pf param in order to tell it to add a position constraint when forming the pf part of the query: pf=title@0:-0 This would only generate a phrase match on title if the phrase is an exact match of the whole field. Potential issues with multi-valued fields? Is the field delimiter clearly marked or is it only an increment gap? Would it be easy to parse such a syntax and generate a Lucene query with the position constraints?
        Hide
        Robert Muir added a comment -

        well its fine if you are doing matching on something really short, you could index with keywordtokenizer and use this for some use cases.

        Show
        Robert Muir added a comment - well its fine if you are doing matching on something really short, you could index with keywordtokenizer and use this for some use cases.
        Hide
        Dawid Weiss added a comment -

        Right... multiple tokens will be an issue here, didn't think of that.

        Show
        Dawid Weiss added a comment - Right... multiple tokens will be an issue here, didn't think of that.
        Hide
        Robert Muir added a comment -

        you just don't need the anchors for this one (its implied).

        the syntax is here: http://www.brics.dk/automaton/doc/dk/brics/automaton/RegExp.html

        i don't know if this really solves your problems, as you are talking about multiple tokens.

        just remember, users have trouble understanding how wildcards interact with stemming and such, so I don't see regexp queries spanning across multiple tokens (analyzed) anytime soon...

        Show
        Robert Muir added a comment - you just don't need the anchors for this one (its implied). the syntax is here: http://www.brics.dk/automaton/doc/dk/brics/automaton/RegExp.html i don't know if this really solves your problems, as you are talking about multiple tokens. just remember, users have trouble understanding how wildcards interact with stemming and such, so I don't see regexp queries spanning across multiple tokens (analyzed) anytime soon...
        Hide
        Dawid Weiss added a comment -

        Yep, it should be – qp.parse("/^quick fox$/"). Peek at TestQueryParser#testRegexps

        Show
        Dawid Weiss added a comment - Yep, it should be – qp.parse("/^quick fox$/"). Peek at TestQueryParser#testRegexps
        Hide
        Jan Høydahl added a comment -

        Is this backed by the Lucene Query parser? How would you query q="^quick fox$" with the regex query?

        Show
        Jan Høydahl added a comment - Is this backed by the Lucene Query parser? How would you query q="^quick fox$" with the regex query?
        Hide
        Dawid Weiss added a comment -

        Isn't it what regexp query (automaton-based) currently does (and does it efficiently)?

        Show
        Dawid Weiss added a comment - Isn't it what regexp query (automaton-based) currently does (and does it efficiently)?
        Hide
        Jan Høydahl added a comment -

        Really, this is a type of feature that should be implemented on the Lucene level with proper query language support. Any suggestion on how this could be done, perhaps using the positions and #terms metadata from the index instead of inserting special tokens at begin and end?

        Show
        Jan Høydahl added a comment - Really, this is a type of feature that should be implemented on the Lucene level with proper query language support. Any suggestion on how this could be done, perhaps using the positions and #terms metadata from the index instead of inserting special tokens at begin and end?
        Hide
        Jan Høydahl added a comment -

        I have tried to implement this as a CharFilter and it works pretty well.

        The problem I face is that inserting extra bytes at the beginning and end of the charstream does not play well with highlighting. I get an error:

        org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token card exceeds length of provided text sized 43
        at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:473)
        at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378)
        at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:121)

        Show
        Jan Høydahl added a comment - I have tried to implement this as a CharFilter and it works pretty well. The problem I face is that inserting extra bytes at the beginning and end of the charstream does not play well with highlighting. I get an error: org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token card exceeds length of provided text sized 43 at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:473) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:121)
        Hide
        Jan Høydahl added a comment -

        Phrase slop would work as before if the ^ and $ are encoded as simple special tokens in the index.

        For multi-valued fields, each sub value need to be tagged.

        I think the "^a b c$" syntax is pretty easy to understand. But does it crash with any other feature or special char? Perhaps some existing regex stuff that I don't know about?

        Show
        Jan Høydahl added a comment - Phrase slop would work as before if the ^ and $ are encoded as simple special tokens in the index. For multi-valued fields, each sub value need to be tagged. I think the "^a b c$" syntax is pretty easy to understand. But does it crash with any other feature or special char? Perhaps some existing regex stuff that I don't know about?
        Hide
        Otis Gospodnetic added a comment -
        Show
        Otis Gospodnetic added a comment - What about Span queries - no use here? http://search-lucene.com/jd/lucene/org/apache/lucene/search/spans/SpanQuery.html
        Hide
        Lance Norskog added a comment -

        Another use case is with phrases, especially sloppy phrases.
        "^hello kitty" would find "hello kitty" at the beginning of the text.
        "^hello"~5 would find "hello" among the first 5 words, but the closer to the beginning, the better. This is especially interesting for consumer searches- people tend to type the first word of a movie title first.

        Show
        Lance Norskog added a comment - Another use case is with phrases, especially sloppy phrases. "^hello kitty" would find "hello kitty" at the beginning of the text. "^hello"~5 would find "hello" among the first 5 words, but the closer to the beginning, the better. This is especially interesting for consumer searches- people tend to type the first word of a movie title first.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jan Høydahl
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development