Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.8
    • Component/s: core/queryparser
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The queryparser parses input on whitespace, and sends each whitespace separated term to its own independent token stream.

      This breaks the following at query-time, because they can't see across whitespace boundaries:

      • n-gram analysis
      • shingles
      • synonyms (especially multi-word for whitespace-separated languages)
      • languages where a 'word' can contain whitespace (e.g. vietnamese)

      Its also rather unexpected, as users think their charfilters/tokenizers/tokenfilters will do the same thing at index and querytime, but
      in many cases they can't. Instead, preferably the queryparser would parse around only real 'operators'.

        Issue Links

          Activity

          Hide
          shenzhuxi added a comment -

          subscribed

          Show
          shenzhuxi added a comment - subscribed
          Hide
          Hoss Man added a comment -

          since (unescaped, unquoted) whitespace characters is the syntax that QueryParser uses to indicate the transition between clauses in a BooleanQuery, changing this (either in QueryParser or in some new query parser) would require coming up with some new syntax. (or in the case of a special case query parser like the FieldQParser in Solr, eliminating the possibility of expressing multi-clause queries)

          Show
          Hoss Man added a comment - since (unescaped, unquoted) whitespace characters is the syntax that QueryParser uses to indicate the transition between clauses in a BooleanQuery, changing this (either in QueryParser or in some new query parser) would require coming up with some new syntax. (or in the case of a special case query parser like the FieldQParser in Solr, eliminating the possibility of expressing multi-clause queries)
          Hide
          John Berryman added a comment - - edited

          subscribed - Current client has index full of clothing - a search for "dress shoes" will return results containing womens' dresses and running shoes. That's not really acceptable.

          Show
          John Berryman added a comment - - edited subscribed - Current client has index full of clothing - a search for "dress shoes" will return results containing womens' dresses and running shoes. That's not really acceptable.
          Hide
          John Berryman added a comment - - edited

          There is somewhat of a workaround for this for defType=lucene. Just escape every whitespace with a slash. So instead of new dress shoes search for new\ dress\ shoes. Of course you lose the ability to use normal lucene syntax.

          I was hoping that this workaround would also work for defType=dismax, but with or without the escaped whitespace, queries get interpreted the same, incorrect way. For instance, assume I have the following line in my synonyms.txt: dress shoes => dress_shoes. Further assume that I have a field experiment that gets analysed with synonyms. A search for new dress shoes (with or without escaped spaces) will be interpreted as

          +((experiment:new)~0.01 (experiment:dress)~0.01 (experiment:shoes)~0.01) (experiment:"new dress_shoes"~3)~0.01

          The first clause is manditory and contains independently analysed tokens, so this will only match documents that contain "dress", "new", or "shoes", but never "dress shoes" because analysis takes place as expected at index time.

          Show
          John Berryman added a comment - - edited There is somewhat of a workaround for this for defType=lucene. Just escape every whitespace with a slash. So instead of new dress shoes search for new\ dress\ shoes . Of course you lose the ability to use normal lucene syntax. I was hoping that this workaround would also work for defType=dismax, but with or without the escaped whitespace, queries get interpreted the same, incorrect way. For instance, assume I have the following line in my synonyms.txt: dress shoes => dress_shoes . Further assume that I have a field experiment that gets analysed with synonyms. A search for new dress shoes (with or without escaped spaces) will be interpreted as +((experiment:new)~0.01 (experiment:dress)~0.01 (experiment:shoes)~0.01) (experiment:"new dress_shoes"~3)~0.01 The first clause is manditory and contains independently analysed tokens, so this will only match documents that contain "dress", "new", or "shoes", but never "dress shoes" because analysis takes place as expected at index time.
          Hide
          Jack Krupansky added a comment -

          My thought on the original issue is that most query parsers should accumulate adjacent terms without intervening operators as a "term list" (quoted phrases would be a second level of term list) and that there needs to be a "list" interface for query term analysis.

          Rather than simply present a raw text stream for the sequence/list of terms, each term would be fed into the token stream with an attribute that indicates which source term it belongs to.

          The synonym processor would see a clean flow of terms and do its processing, but would also need to associate an id with each term of a multi-term synonym phrase so that multiple multi-word synonym choices for the same input term(s) don't get mixed up (i.e., multiple tokens at the same position with no indication of which original synonym phrase they came from).

          By having those ID's for each multi-term synonym phrase, the caller of the list analyzer could then recontruct the tree of "OR" expressions for the various multi-term synonym phrases.

          Show
          Jack Krupansky added a comment - My thought on the original issue is that most query parsers should accumulate adjacent terms without intervening operators as a "term list" (quoted phrases would be a second level of term list) and that there needs to be a "list" interface for query term analysis. Rather than simply present a raw text stream for the sequence/list of terms, each term would be fed into the token stream with an attribute that indicates which source term it belongs to. The synonym processor would see a clean flow of terms and do its processing, but would also need to associate an id with each term of a multi-term synonym phrase so that multiple multi-word synonym choices for the same input term(s) don't get mixed up (i.e., multiple tokens at the same position with no indication of which original synonym phrase they came from). By having those ID's for each multi-term synonym phrase, the caller of the list analyzer could then recontruct the tree of "OR" expressions for the various multi-term synonym phrases.
          Hide
          John Berryman added a comment -

          (How's it going Jack) Interesting idea, though I really need to crack into the QueryParser and play around a little bit before I have a strong opinion myself.

          Show
          John Berryman added a comment - (How's it going Jack) Interesting idea, though I really need to crack into the QueryParser and play around a little bit before I have a strong opinion myself.
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              14 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

              • Created:
                Updated:

                Development