Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.2
    • Component/s: core/queryparser
    • Labels:
      None

      Description

      The queryparser parses input on whitespace, and sends each whitespace separated term to its own independent token stream.

      This breaks the following at query-time, because they can't see across whitespace boundaries:

      • n-gram analysis
      • shingles
      • synonyms (especially multi-word for whitespace-separated languages)
      • languages where a 'word' can contain whitespace (e.g. vietnamese)

      Its also rather unexpected, as users think their charfilters/tokenizers/tokenfilters will do the same thing at index and querytime, but
      in many cases they can't. Instead, preferably the queryparser would parse around only real 'operators'.

      1. LUCENE-2605.patch
        86 kB
        Steve Rowe
      2. LUCENE-2605.patch
        57 kB
        Steve Rowe
      3. LUCENE-2605.patch
        44 kB
        Steve Rowe
      4. LUCENE-2605.patch
        48 kB
        Steve Rowe
      5. LUCENE-2605.patch
        21 kB
        Steve Rowe
      6. LUCENE-2605.patch
        12 kB
        Steve Rowe
      7. LUCENE-2605-dont-split-by-default.patch
        5 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          shenzhuxi added a comment -

          subscribed

          Show
          shenzhuxi added a comment - subscribed
          Hide
          Hoss Man added a comment -

          since (unescaped, unquoted) whitespace characters is the syntax that QueryParser uses to indicate the transition between clauses in a BooleanQuery, changing this (either in QueryParser or in some new query parser) would require coming up with some new syntax. (or in the case of a special case query parser like the FieldQParser in Solr, eliminating the possibility of expressing multi-clause queries)

          Show
          Hoss Man added a comment - since (unescaped, unquoted) whitespace characters is the syntax that QueryParser uses to indicate the transition between clauses in a BooleanQuery, changing this (either in QueryParser or in some new query parser) would require coming up with some new syntax. (or in the case of a special case query parser like the FieldQParser in Solr, eliminating the possibility of expressing multi-clause queries)
          Hide
          John Berryman added a comment - - edited

          subscribed - Current client has index full of clothing - a search for "dress shoes" will return results containing womens' dresses and running shoes. That's not really acceptable.

          Show
          John Berryman added a comment - - edited subscribed - Current client has index full of clothing - a search for "dress shoes" will return results containing womens' dresses and running shoes. That's not really acceptable.
          Hide
          John Berryman added a comment - - edited

          There is somewhat of a workaround for this for defType=lucene. Just escape every whitespace with a slash. So instead of new dress shoes search for new\ dress\ shoes. Of course you lose the ability to use normal lucene syntax.

          I was hoping that this workaround would also work for defType=dismax, but with or without the escaped whitespace, queries get interpreted the same, incorrect way. For instance, assume I have the following line in my synonyms.txt: dress shoes => dress_shoes. Further assume that I have a field experiment that gets analysed with synonyms. A search for new dress shoes (with or without escaped spaces) will be interpreted as

          +((experiment:new)~0.01 (experiment:dress)~0.01 (experiment:shoes)~0.01) (experiment:"new dress_shoes"~3)~0.01

          The first clause is manditory and contains independently analysed tokens, so this will only match documents that contain "dress", "new", or "shoes", but never "dress shoes" because analysis takes place as expected at index time.

          Show
          John Berryman added a comment - - edited There is somewhat of a workaround for this for defType=lucene. Just escape every whitespace with a slash. So instead of new dress shoes search for new\ dress\ shoes . Of course you lose the ability to use normal lucene syntax. I was hoping that this workaround would also work for defType=dismax, but with or without the escaped whitespace, queries get interpreted the same, incorrect way. For instance, assume I have the following line in my synonyms.txt: dress shoes => dress_shoes . Further assume that I have a field experiment that gets analysed with synonyms. A search for new dress shoes (with or without escaped spaces) will be interpreted as +((experiment:new)~0.01 (experiment:dress)~0.01 (experiment:shoes)~0.01) (experiment:"new dress_shoes"~3)~0.01 The first clause is manditory and contains independently analysed tokens, so this will only match documents that contain "dress", "new", or "shoes", but never "dress shoes" because analysis takes place as expected at index time.
          Hide
          Jack Krupansky added a comment -

          My thought on the original issue is that most query parsers should accumulate adjacent terms without intervening operators as a "term list" (quoted phrases would be a second level of term list) and that there needs to be a "list" interface for query term analysis.

          Rather than simply present a raw text stream for the sequence/list of terms, each term would be fed into the token stream with an attribute that indicates which source term it belongs to.

          The synonym processor would see a clean flow of terms and do its processing, but would also need to associate an id with each term of a multi-term synonym phrase so that multiple multi-word synonym choices for the same input term(s) don't get mixed up (i.e., multiple tokens at the same position with no indication of which original synonym phrase they came from).

          By having those ID's for each multi-term synonym phrase, the caller of the list analyzer could then recontruct the tree of "OR" expressions for the various multi-term synonym phrases.

          Show
          Jack Krupansky added a comment - My thought on the original issue is that most query parsers should accumulate adjacent terms without intervening operators as a "term list" (quoted phrases would be a second level of term list) and that there needs to be a "list" interface for query term analysis. Rather than simply present a raw text stream for the sequence/list of terms, each term would be fed into the token stream with an attribute that indicates which source term it belongs to. The synonym processor would see a clean flow of terms and do its processing, but would also need to associate an id with each term of a multi-term synonym phrase so that multiple multi-word synonym choices for the same input term(s) don't get mixed up (i.e., multiple tokens at the same position with no indication of which original synonym phrase they came from). By having those ID's for each multi-term synonym phrase, the caller of the list analyzer could then recontruct the tree of "OR" expressions for the various multi-term synonym phrases.
          Hide
          John Berryman added a comment -

          (How's it going Jack) Interesting idea, though I really need to crack into the QueryParser and play around a little bit before I have a strong opinion myself.

          Show
          John Berryman added a comment - (How's it going Jack) Interesting idea, though I really need to crack into the QueryParser and play around a little bit before I have a strong opinion myself.
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.
          Hide
          Fuad Efendi added a comment -

          Is that resolved? Anyone working on it? Thanks

          Show
          Fuad Efendi added a comment - Is that resolved? Anyone working on it? Thanks
          Hide
          Steve Rowe added a comment -

          Initial patch against the classic Lucene QueryParser grammar only. Runs of whitespace-separated text with no operators are now sent as a single input to the analyzer, rather than first splitting on whitespace.

          There are 10 failing tests in the queryparser module (didn't try anywhere else yet), mostly caused by the way getFieldQuery() produces a nested query when there are multiple terms. I'll look into how to address that.

          Show
          Steve Rowe added a comment - Initial patch against the classic Lucene QueryParser grammar only. Runs of whitespace-separated text with no operators are now sent as a single input to the analyzer, rather than first splitting on whitespace. There are 10 failing tests in the queryparser module (didn't try anywhere else yet), mostly caused by the way getFieldQuery() produces a nested query when there are multiple terms. I'll look into how to address that.
          Hide
          Steve Rowe added a comment -

          Patch fixing up two problems:

          1. Multiple whitespace-separated terms' TermQuery-s within a BooleanClause are now flattened directly into the output Query list, rather than inserting the BooleanClause.
          2. MultiFieldQuery's getFieldQuery() is modified to recombine multiple terms from each field's query, to produce a series of disjunctions of term against each field.

          All queryparser module tests now pass, with the exception of the flexible query parser's TestStandardQP run with QueryParserTestBase.testQPA(). Since this patch doesn't modify anything about the flexible query parser, this is not surprising.

          Show
          Steve Rowe added a comment - Patch fixing up two problems: Multiple whitespace-separated terms' TermQuery-s within a BooleanClause are now flattened directly into the output Query list, rather than inserting the BooleanClause. MultiFieldQuery's getFieldQuery() is modified to recombine multiple terms from each field's query, to produce a series of disjunctions of term against each field. All queryparser module tests now pass, with the exception of the flexible query parser's TestStandardQP run with QueryParserTestBase.testQPA(). Since this patch doesn't modify anything about the flexible query parser, this is not surprising.
          Hide
          Steve Rowe added a comment -

          Patch, I think it's ready.

          I've pulled MockSynonymFilter/Analyzer out into their own files in lucene-test-framework, and added tests for it, and added a fixed multi-term source synonym with one single-term target.

          I added tests to TestQueryParser using the modified MockSynonymAnalyzer ensuring operators block multi-term analysis when they should and don't when they shouldn't.

          I'll go make issues now for converting Solr's clone of this QueryParser, and the standard flexible query parser, to add the same capabilities.

          Show
          Steve Rowe added a comment - Patch, I think it's ready. I've pulled MockSynonymFilter/Analyzer out into their own files in lucene-test-framework, and added tests for it, and added a fixed multi-term source synonym with one single-term target. I added tests to TestQueryParser using the modified MockSynonymAnalyzer ensuring operators block multi-term analysis when they should and don't when they shouldn't. I'll go make issues now for converting Solr's clone of this QueryParser, and the standard flexible query parser, to add the same capabilities.
          Hide
          Robert Muir added a comment -

          +1. I did not think it would ever get fixed!

          Show
          Robert Muir added a comment - +1. I did not think it would ever get fixed!
          Hide
          Uwe Schindler added a comment -

          Incredible!

          Strong +1

          Show
          Uwe Schindler added a comment - Incredible! Strong +1
          Hide
          Fuad Efendi added a comment -

          This one was really painful problem (unexpected "tokenization" by query parser!)
          Thank you for fixing that!

          Show
          Fuad Efendi added a comment - This one was really painful problem (unexpected "tokenization" by query parser!) Thank you for fixing that!
          Hide
          Otis Gospodnetic added a comment - - edited

          Steve Rowe you are about to become everyone's hero and a household name!
          Is this going to be in the upcoming 6.1?

          Show
          Otis Gospodnetic added a comment - - edited Steve Rowe you are about to become everyone's hero and a household name! Is this going to be in the upcoming 6.1?
          Hide
          Steve Rowe added a comment -

          No, it will not be in 6.1, but I expect it will make 6.2.

          Show
          Steve Rowe added a comment - No, it will not be in 6.1, but I expect it will make 6.2.
          Hide
          Steve Rowe added a comment -

          Patch adding an option to preserve the old behavior: {set/get}SplitOnWhitespace(), defaulting to true (the current behavior).

          Though nobody said so here, on the Solr issue (SOLR-9185), a couple people mentioned that the old behavior should be preserved, and not be the default until a major release.

          That's what this patch does.

          Show
          Steve Rowe added a comment - Patch adding an option to preserve the old behavior: { set/get}SplitOnWhitespace() , defaulting to true (the current behavior). Though nobody said so here, on the Solr issue ( SOLR-9185 ), a couple people mentioned that the old behavior should be preserved, and not be the default until a major release. That's what this patch does.
          Hide
          Steve Rowe added a comment -

          Patch adds lucene-test-framework files missing from last version of the patch. Also adds a CHANGES entry.

          I plan on committing in a couple days if there are no objections.

          Show
          Steve Rowe added a comment - Patch adds lucene-test-framework files missing from last version of the patch. Also adds a CHANGES entry. I plan on committing in a couple days if there are no objections.
          Hide
          Steve Rowe added a comment -

          Okay, really final patch. On SOLR-9185 I was having trouble integrating the Solr standard QP's comment support with the whitespace tokenization I introduced here, so I tried switching the Solr parser back to ignoring both whitespace and comments, and it worked. The patch brings this grammar simplification back here too - in addition to many fewer whitespace mentions in the rules, fewer (and less complicated) lookaheads are required.

          I've included the generated files in the patch.

          No tests changed from the last patch.

          All Lucene tests pass, and precommit passes.

          Show
          Steve Rowe added a comment - Okay, really final patch. On SOLR-9185 I was having trouble integrating the Solr standard QP's comment support with the whitespace tokenization I introduced here, so I tried switching the Solr parser back to ignoring both whitespace and comments, and it worked. The patch brings this grammar simplification back here too - in addition to many fewer whitespace mentions in the rules, fewer (and less complicated) lookaheads are required. I've included the generated files in the patch. No tests changed from the last patch. All Lucene tests pass, and precommit passes.
          Hide
          ASF subversion and git services added a comment -

          Commit 7d092fac4eabab42006e8e2b5c8a149cb266350c in lucene-solr's branch refs/heads/branch_6x from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7d092fa ]

          LUCENE-2605: Add classic QueryParser option setSplitOnWhitespace() to control whether to split on whitespace prior to text analysis. Default behavior remains unchanged: split-on-whitespace=true.

          Show
          ASF subversion and git services added a comment - Commit 7d092fac4eabab42006e8e2b5c8a149cb266350c in lucene-solr's branch refs/heads/branch_6x from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7d092fa ] LUCENE-2605 : Add classic QueryParser option setSplitOnWhitespace() to control whether to split on whitespace prior to text analysis. Default behavior remains unchanged: split-on-whitespace=true.
          Hide
          ASF subversion and git services added a comment -

          Commit 17d113dac1e6081a48144679234b00a556210160 in lucene-solr's branch refs/heads/master from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=17d113d ]

          LUCENE-2605: Add classic QueryParser option setSplitOnWhitespace() to control whether to split on whitespace prior to text analysis. Default behavior remains unchanged: split-on-whitespace=true.

          Show
          ASF subversion and git services added a comment - Commit 17d113dac1e6081a48144679234b00a556210160 in lucene-solr's branch refs/heads/master from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=17d113d ] LUCENE-2605 : Add classic QueryParser option setSplitOnWhitespace() to control whether to split on whitespace prior to text analysis. Default behavior remains unchanged: split-on-whitespace=true.
          Hide
          Steve Rowe added a comment -

          Patch for master only that switches the default split-on-whitespace behavior from do to don't.

          Committing shortly.

          Show
          Steve Rowe added a comment - Patch for master only that switches the default split-on-whitespace behavior from do to don't . Committing shortly.
          Hide
          ASF subversion and git services added a comment -

          Commit 24e5c1c20581f115a2948a72a4cef82a232bb28d in lucene-solr's branch refs/heads/master from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=24e5c1c ]

          LUCENE-2605: Classic QueryParser no longer splits on whitespace by default.

          Show
          ASF subversion and git services added a comment - Commit 24e5c1c20581f115a2948a72a4cef82a232bb28d in lucene-solr's branch refs/heads/master from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=24e5c1c ] LUCENE-2605 : Classic QueryParser no longer splits on whitespace by default.
          Hide
          Varun Thacker added a comment -

          Removing the unused imports from QueryParserTokenManager.java makes precommit pass again.

          Show
          Varun Thacker added a comment - Removing the unused imports from QueryParserTokenManager.java makes precommit pass again.
          Hide
          ASF subversion and git services added a comment -

          Commit 032e31aea7ce5ea4d396302a1cbeab8a6c525f7d in lucene-solr's branch refs/heads/master from Varun Thacker
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=032e31a ]

          LUCENE-2605: Remove unused imports

          Show
          ASF subversion and git services added a comment - Commit 032e31aea7ce5ea4d396302a1cbeab8a6c525f7d in lucene-solr's branch refs/heads/master from Varun Thacker [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=032e31a ] LUCENE-2605 : Remove unused imports
          Hide
          Steve Rowe added a comment -

          Removing the unused imports from QueryParserTokenManager.java makes precommit pass again.

          Thanks Varun Thacker! I forgot to run precommit after I regenerated on master to change the default split-on-whitespace option.

          Show
          Steve Rowe added a comment - Removing the unused imports from QueryParserTokenManager.java makes precommit pass again. Thanks Varun Thacker ! I forgot to run precommit after I regenerated on master to change the default split-on-whitespace option.
          Hide
          David Smiley added a comment -

          Just adding a little note that SimpleQParser parses on whitespace and analyzes tokens separately. Some day it'd be nice to see that one updated.

          Show
          David Smiley added a comment - Just adding a little note that SimpleQParser parses on whitespace and analyzes tokens separately. Some day it'd be nice to see that one updated.
          Hide
          Adrien Grand added a comment -

          David Smiley I think you can already disable splitting on whitespace with the SimpleQueryParser by disabling the whitespace operator?

          Show
          Adrien Grand added a comment - David Smiley I think you can already disable splitting on whitespace with the SimpleQueryParser by disabling the whitespace operator?
          Hide
          David Smiley added a comment -

          Adrien Grand Indeed you are right! I didn't think to look in the operator list, and didn't notice a test for it. I just tried it out using Solr.

          Show
          David Smiley added a comment - Adrien Grand Indeed you are right! I didn't think to look in the operator list, and didn't notice a test for it. I just tried it out using Solr.
          Hide
          Michael McCandless added a comment -

          Bulk close resolved issues after 6.2.0 release.

          Show
          Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Robert Muir
            • Votes:
              28 Vote for this issue
              Watchers:
              52 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development