Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/queryparser
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The queryparser parses input on whitespace, and sends each whitespace separated term to its own independent token stream.

      This breaks the following at query-time, because they can't see across whitespace boundaries:

      • n-gram analysis
      • shingles
      • synonyms (especially multi-word for whitespace-separated languages)
      • languages where a 'word' can contain whitespace (e.g. vietnamese)

      Its also rather unexpected, as users think their charfilters/tokenizers/tokenfilters will do the same thing at index and querytime, but
      in many cases they can't. Instead, preferably the queryparser would parse around only real 'operators'.

      1. LUCENE-2605.patch
        86 kB
        Steve Rowe
      2. LUCENE-2605.patch
        57 kB
        Steve Rowe
      3. LUCENE-2605.patch
        44 kB
        Steve Rowe
      4. LUCENE-2605.patch
        48 kB
        Steve Rowe
      5. LUCENE-2605.patch
        21 kB
        Steve Rowe
      6. LUCENE-2605.patch
        12 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          shenzhuxi added a comment -

          subscribed

          Show
          shenzhuxi added a comment - subscribed
          Hide
          Hoss Man added a comment -

          since (unescaped, unquoted) whitespace characters is the syntax that QueryParser uses to indicate the transition between clauses in a BooleanQuery, changing this (either in QueryParser or in some new query parser) would require coming up with some new syntax. (or in the case of a special case query parser like the FieldQParser in Solr, eliminating the possibility of expressing multi-clause queries)

          Show
          Hoss Man added a comment - since (unescaped, unquoted) whitespace characters is the syntax that QueryParser uses to indicate the transition between clauses in a BooleanQuery, changing this (either in QueryParser or in some new query parser) would require coming up with some new syntax. (or in the case of a special case query parser like the FieldQParser in Solr, eliminating the possibility of expressing multi-clause queries)
          Hide
          John Berryman added a comment - - edited

          subscribed - Current client has index full of clothing - a search for "dress shoes" will return results containing womens' dresses and running shoes. That's not really acceptable.

          Show
          John Berryman added a comment - - edited subscribed - Current client has index full of clothing - a search for "dress shoes" will return results containing womens' dresses and running shoes. That's not really acceptable.
          Hide
          John Berryman added a comment - - edited

          There is somewhat of a workaround for this for defType=lucene. Just escape every whitespace with a slash. So instead of new dress shoes search for new\ dress\ shoes. Of course you lose the ability to use normal lucene syntax.

          I was hoping that this workaround would also work for defType=dismax, but with or without the escaped whitespace, queries get interpreted the same, incorrect way. For instance, assume I have the following line in my synonyms.txt: dress shoes => dress_shoes. Further assume that I have a field experiment that gets analysed with synonyms. A search for new dress shoes (with or without escaped spaces) will be interpreted as

          +((experiment:new)~0.01 (experiment:dress)~0.01 (experiment:shoes)~0.01) (experiment:"new dress_shoes"~3)~0.01

          The first clause is manditory and contains independently analysed tokens, so this will only match documents that contain "dress", "new", or "shoes", but never "dress shoes" because analysis takes place as expected at index time.

          Show
          John Berryman added a comment - - edited There is somewhat of a workaround for this for defType=lucene. Just escape every whitespace with a slash. So instead of new dress shoes search for new\ dress\ shoes . Of course you lose the ability to use normal lucene syntax. I was hoping that this workaround would also work for defType=dismax, but with or without the escaped whitespace, queries get interpreted the same, incorrect way. For instance, assume I have the following line in my synonyms.txt: dress shoes => dress_shoes . Further assume that I have a field experiment that gets analysed with synonyms. A search for new dress shoes (with or without escaped spaces) will be interpreted as +((experiment:new)~0.01 (experiment:dress)~0.01 (experiment:shoes)~0.01) (experiment:"new dress_shoes"~3)~0.01 The first clause is manditory and contains independently analysed tokens, so this will only match documents that contain "dress", "new", or "shoes", but never "dress shoes" because analysis takes place as expected at index time.
          Hide
          Jack Krupansky added a comment -

          My thought on the original issue is that most query parsers should accumulate adjacent terms without intervening operators as a "term list" (quoted phrases would be a second level of term list) and that there needs to be a "list" interface for query term analysis.

          Rather than simply present a raw text stream for the sequence/list of terms, each term would be fed into the token stream with an attribute that indicates which source term it belongs to.

          The synonym processor would see a clean flow of terms and do its processing, but would also need to associate an id with each term of a multi-term synonym phrase so that multiple multi-word synonym choices for the same input term(s) don't get mixed up (i.e., multiple tokens at the same position with no indication of which original synonym phrase they came from).

          By having those ID's for each multi-term synonym phrase, the caller of the list analyzer could then recontruct the tree of "OR" expressions for the various multi-term synonym phrases.

          Show
          Jack Krupansky added a comment - My thought on the original issue is that most query parsers should accumulate adjacent terms without intervening operators as a "term list" (quoted phrases would be a second level of term list) and that there needs to be a "list" interface for query term analysis. Rather than simply present a raw text stream for the sequence/list of terms, each term would be fed into the token stream with an attribute that indicates which source term it belongs to. The synonym processor would see a clean flow of terms and do its processing, but would also need to associate an id with each term of a multi-term synonym phrase so that multiple multi-word synonym choices for the same input term(s) don't get mixed up (i.e., multiple tokens at the same position with no indication of which original synonym phrase they came from). By having those ID's for each multi-term synonym phrase, the caller of the list analyzer could then recontruct the tree of "OR" expressions for the various multi-term synonym phrases.
          Hide
          John Berryman added a comment -

          (How's it going Jack) Interesting idea, though I really need to crack into the QueryParser and play around a little bit before I have a strong opinion myself.

          Show
          John Berryman added a comment - (How's it going Jack) Interesting idea, though I really need to crack into the QueryParser and play around a little bit before I have a strong opinion myself.
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.
          Hide
          Fuad Efendi added a comment -

          Is that resolved? Anyone working on it? Thanks

          Show
          Fuad Efendi added a comment - Is that resolved? Anyone working on it? Thanks
          Hide
          Steve Rowe added a comment -

          Initial patch against the classic Lucene QueryParser grammar only. Runs of whitespace-separated text with no operators are now sent as a single input to the analyzer, rather than first splitting on whitespace.

          There are 10 failing tests in the queryparser module (didn't try anywhere else yet), mostly caused by the way getFieldQuery() produces a nested query when there are multiple terms. I'll look into how to address that.

          Show
          Steve Rowe added a comment - Initial patch against the classic Lucene QueryParser grammar only. Runs of whitespace-separated text with no operators are now sent as a single input to the analyzer, rather than first splitting on whitespace. There are 10 failing tests in the queryparser module (didn't try anywhere else yet), mostly caused by the way getFieldQuery() produces a nested query when there are multiple terms. I'll look into how to address that.
          Hide
          Steve Rowe added a comment -

          Patch fixing up two problems:

          1. Multiple whitespace-separated terms' TermQuery-s within a BooleanClause are now flattened directly into the output Query list, rather than inserting the BooleanClause.
          2. MultiFieldQuery's getFieldQuery() is modified to recombine multiple terms from each field's query, to produce a series of disjunctions of term against each field.

          All queryparser module tests now pass, with the exception of the flexible query parser's TestStandardQP run with QueryParserTestBase.testQPA(). Since this patch doesn't modify anything about the flexible query parser, this is not surprising.

          Show
          Steve Rowe added a comment - Patch fixing up two problems: Multiple whitespace-separated terms' TermQuery-s within a BooleanClause are now flattened directly into the output Query list, rather than inserting the BooleanClause. MultiFieldQuery's getFieldQuery() is modified to recombine multiple terms from each field's query, to produce a series of disjunctions of term against each field. All queryparser module tests now pass, with the exception of the flexible query parser's TestStandardQP run with QueryParserTestBase.testQPA(). Since this patch doesn't modify anything about the flexible query parser, this is not surprising.
          Hide
          Steve Rowe added a comment -

          Patch, I think it's ready.

          I've pulled MockSynonymFilter/Analyzer out into their own files in lucene-test-framework, and added tests for it, and added a fixed multi-term source synonym with one single-term target.

          I added tests to TestQueryParser using the modified MockSynonymAnalyzer ensuring operators block multi-term analysis when they should and don't when they shouldn't.

          I'll go make issues now for converting Solr's clone of this QueryParser, and the standard flexible query parser, to add the same capabilities.

          Show
          Steve Rowe added a comment - Patch, I think it's ready. I've pulled MockSynonymFilter/Analyzer out into their own files in lucene-test-framework, and added tests for it, and added a fixed multi-term source synonym with one single-term target. I added tests to TestQueryParser using the modified MockSynonymAnalyzer ensuring operators block multi-term analysis when they should and don't when they shouldn't. I'll go make issues now for converting Solr's clone of this QueryParser, and the standard flexible query parser, to add the same capabilities.
          Hide
          Robert Muir added a comment -

          +1. I did not think it would ever get fixed!

          Show
          Robert Muir added a comment - +1. I did not think it would ever get fixed!
          Hide
          Uwe Schindler added a comment -

          Incredible!

          Strong +1

          Show
          Uwe Schindler added a comment - Incredible! Strong +1
          Hide
          Fuad Efendi added a comment -

          This one was really painful problem (unexpected "tokenization" by query parser!)
          Thank you for fixing that!

          Show
          Fuad Efendi added a comment - This one was really painful problem (unexpected "tokenization" by query parser!) Thank you for fixing that!
          Hide
          Otis Gospodnetic added a comment - - edited

          Steve Rowe you are about to become everyone's hero and a household name!
          Is this going to be in the upcoming 6.1?

          Show
          Otis Gospodnetic added a comment - - edited Steve Rowe you are about to become everyone's hero and a household name! Is this going to be in the upcoming 6.1?
          Hide
          Steve Rowe added a comment -

          No, it will not be in 6.1, but I expect it will make 6.2.

          Show
          Steve Rowe added a comment - No, it will not be in 6.1, but I expect it will make 6.2.
          Hide
          Steve Rowe added a comment -

          Patch adding an option to preserve the old behavior: {set/get}SplitOnWhitespace(), defaulting to true (the current behavior).

          Though nobody said so here, on the Solr issue (SOLR-9185), a couple people mentioned that the old behavior should be preserved, and not be the default until a major release.

          That's what this patch does.

          Show
          Steve Rowe added a comment - Patch adding an option to preserve the old behavior: { set/get}SplitOnWhitespace() , defaulting to true (the current behavior). Though nobody said so here, on the Solr issue ( SOLR-9185 ), a couple people mentioned that the old behavior should be preserved, and not be the default until a major release. That's what this patch does.
          Hide
          Steve Rowe added a comment -

          Patch adds lucene-test-framework files missing from last version of the patch. Also adds a CHANGES entry.

          I plan on committing in a couple days if there are no objections.

          Show
          Steve Rowe added a comment - Patch adds lucene-test-framework files missing from last version of the patch. Also adds a CHANGES entry. I plan on committing in a couple days if there are no objections.
          Hide
          Steve Rowe added a comment -

          Okay, really final patch. On SOLR-9185 I was having trouble integrating the Solr standard QP's comment support with the whitespace tokenization I introduced here, so I tried switching the Solr parser back to ignoring both whitespace and comments, and it worked. The patch brings this grammar simplification back here too - in addition to many fewer whitespace mentions in the rules, fewer (and less complicated) lookaheads are required.

          I've included the generated files in the patch.

          No tests changed from the last patch.

          All Lucene tests pass, and precommit passes.

          Show
          Steve Rowe added a comment - Okay, really final patch. On SOLR-9185 I was having trouble integrating the Solr standard QP's comment support with the whitespace tokenization I introduced here, so I tried switching the Solr parser back to ignoring both whitespace and comments, and it worked. The patch brings this grammar simplification back here too - in addition to many fewer whitespace mentions in the rules, fewer (and less complicated) lookaheads are required. I've included the generated files in the patch. No tests changed from the last patch. All Lucene tests pass, and precommit passes.

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Robert Muir
            • Votes:
              28 Vote for this issue
              Watchers:
              49 Start watching this issue

              Dates

              • Created:
                Updated:

                Development