Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9185

Solr's edismax and "Lucene"/standard query parsers should optionally not split on whitespace before sending terms to analysis

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.5, 7.0
    • Component/s: None
    • Labels:
      None

      Description

      Copied from LUCENE-2605:

      The queryparser parses input on whitespace, and sends each whitespace separated term to its own independent token stream.

      This breaks the following at query-time, because they can't see across whitespace boundaries:

      n-gram analysis
      shingles
      synonyms (especially multi-word for whitespace-separated languages)
      languages where a 'word' can contain whitespace (e.g. vietnamese)

      Its also rather unexpected, as users think their charfilters/tokenizers/tokenfilters will do the same thing at index and querytime, but
      in many cases they can't. Instead, preferably the queryparser would parse around only real 'operators'.

      1. SOLR-9185.patch
        16 kB
        Steve Rowe
      2. SOLR-9185.patch
        21 kB
        Steve Rowe
      3. SOLR-9185.patch
        24 kB
        Steve Rowe
      4. SOLR-9185.patch
        154 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          steve_rowe Steve Rowe added a comment -

          For edismax, this is a departure: it's supposed to never throw exceptions.

          I'm not sure this is completely accurate. The original dismax parser almost never throws exceptions, mostly because it doesn't handle standard syntax for specifying fields, operators, etc. Because edismax does allow most of that syntax, I think exceptions are expected.

          edismax will catch parsing exceptions, and attempt alternate interpretation by escaping operators in the original query, then re-parsing.

          Show
          steve_rowe Steve Rowe added a comment - For edismax, this is a departure: it's supposed to never throw exceptions. I'm not sure this is completely accurate. The original dismax parser almost never throws exceptions, mostly because it doesn't handle standard syntax for specifying fields, operators, etc. Because edismax does allow most of that syntax, I think exceptions are expected. edismax will catch parsing exceptions, and attempt alternate interpretation by escaping operators in the original query, then re-parsing.
          Hide
          elyograg Shawn Heisey added a comment -

          For edismax, this is a departure: it's supposed to never throw exceptions.

          I'm not sure this is completely accurate. The original dismax parser almost never throws exceptions, mostly because it doesn't handle standard syntax for specifying fields, operators, etc. Because edismax does allow most of that syntax, I think exceptions are expected.

          Show
          elyograg Shawn Heisey added a comment - For edismax, this is a departure: it's supposed to never throw exceptions. I'm not sure this is completely accurate. The original dismax parser almost never throws exceptions, mostly because it doesn't handle standard syntax for specifying fields, operators, etc. Because edismax does allow most of that syntax, I think exceptions are expected.
          Hide
          steve_rowe Steve Rowe added a comment -

          After this has been committed, I'll make a new issue to switch the default behavior on 7.0/master to sow=false.

          Done: SOLR-10310

          Show
          steve_rowe Steve Rowe added a comment - After this has been committed, I'll make a new issue to switch the default behavior on 7.0/master to sow=false . Done: SOLR-10310
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit d1b2fb33ef3bc0ced65feb98c31cffe4f209da7f in lucene-solr's branch refs/heads/master from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d1b2fb3 ]

          SOLR-9185: Solr's edismax and Lucene/standard query parsers should optionally not split on whitespace before sending terms to analysis

          Show
          jira-bot ASF subversion and git services added a comment - Commit d1b2fb33ef3bc0ced65feb98c31cffe4f209da7f in lucene-solr's branch refs/heads/master from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d1b2fb3 ] SOLR-9185 : Solr's edismax and Lucene/standard query parsers should optionally not split on whitespace before sending terms to analysis
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 3a3acfda04c3b2490adc8cb55001c82e21961cb5 in lucene-solr's branch refs/heads/branch_6x from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3a3acfd ]

          SOLR-9185: Solr's edismax and Lucene/standard query parsers should optionally not split on whitespace before sending terms to analysis

          Conflicts:
          solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java

          Show
          jira-bot ASF subversion and git services added a comment - Commit 3a3acfda04c3b2490adc8cb55001c82e21961cb5 in lucene-solr's branch refs/heads/branch_6x from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3a3acfd ] SOLR-9185 : Solr's edismax and Lucene/standard query parsers should optionally not split on whitespace before sending terms to analysis Conflicts: solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java
          Hide
          steve_rowe Steve Rowe added a comment - - edited

          Patch addressing the remaining issues. Precommit and all Solr tests pass. I plan on committing this shortly so that it will make the 6.5 release.

          Both edismax and the standard query parser are covered. I did not add this feature to the dismax parser (or to any other Solr query parsers); if people want this feature added elsewhere, we can do that under another issue.

          Some implementation notes:

          • As noted in previous comments on this issue, the feature is activated by specifying request param sow=false. By default, sow=true; there is no behavior change at all if the sow param is not specified.
          • I ran TestSolrQueryParser.testParsingPerformance() under three conditions: a) unpatched; b) patched using the default behavior (same as sow=true); and c) patched with sow=false to activate the don't-split-on-whitespace code. The best-of-ten results run in a bash loop on my Linux box show all three within about 0.5% of each other's QPS (likely noise): between 91K and 92K QPS. Average-of-ten puts the two patched conditions at roughly 2% slower (88K vs. 90K QPS). I think this is acceptable.
          • When per-field query structures differ, e.g. when one field's analyzer removes stopwords and another's doesn't, edismax's DisjunctionMaxQuery structure when sow=false differs from that produced when sow=true. Briefly, sow=true produces a boolean query containing one dismax query per query term, while sow=false produces a dismax query containing one boolean query per field. Min-should-match processing does (what I think is) the right thing here. See TestExtendedDismaxParser.testSplitOnWhitespace_Different_Field_Analysis() for some examples of this. Note: when sow=false and all queried fields' query structure is the same, edismax does what it has always done: produce a boolean query containing one dismax query per term.
          • There is a new test suite TestMultiWordSynonyms that shows multi-term source synonyms matching at query-time.
          • In order to deal with the set query changes introduced by SOLR-9786, I extended SolrQueryParserBase.RawQuery to hold an array of terms, to enable their later consumption as either a concatenated string (for tokenized fields) or individually (for non-tokenized fields).
          • As noted on LUCENE-7533 for Lucene's classic query parser (and equally applicable to the Solr standard and edismax query parsers), split-on-whitespace=false and autoGeneratePhraseQueries=true don't play well together at present. I've introduced a new exception QueryParserConfigurationException that will be thrown if any queried field is configured with autoGeneratePhraseQueries=true when the sow=false request param is specified. For edismax, this is a departure: it's supposed to never throw exceptions. I think that's okay for now though, since this is an optional/experimental feature. Maybe when sow=false becomes the default (later, under another issue - see below), edismax should just log a warning and produce a query that excludes fields with this problem?

          After this has been committed, I'll make a new issue to switch the default behavior on 7.0/master to sow=false.

          Show
          steve_rowe Steve Rowe added a comment - - edited Patch addressing the remaining issues. Precommit and all Solr tests pass. I plan on committing this shortly so that it will make the 6.5 release. Both edismax and the standard query parser are covered. I did not add this feature to the dismax parser (or to any other Solr query parsers); if people want this feature added elsewhere, we can do that under another issue. Some implementation notes: As noted in previous comments on this issue, the feature is activated by specifying request param sow=false . By default, sow=true ; there is no behavior change at all if the sow param is not specified. I ran TestSolrQueryParser.testParsingPerformance() under three conditions: a) unpatched; b) patched using the default behavior (same as sow=true ); and c) patched with sow=false to activate the don't-split-on-whitespace code. The best-of-ten results run in a bash loop on my Linux box show all three within about 0.5% of each other's QPS (likely noise): between 91K and 92K QPS. Average-of-ten puts the two patched conditions at roughly 2% slower (88K vs. 90K QPS). I think this is acceptable. When per-field query structures differ, e.g. when one field's analyzer removes stopwords and another's doesn't, edismax's DisjunctionMaxQuery structure when sow=false differs from that produced when sow=true . Briefly, sow=true produces a boolean query containing one dismax query per query term, while sow=false produces a dismax query containing one boolean query per field. Min-should-match processing does (what I think is) the right thing here. See TestExtendedDismaxParser.testSplitOnWhitespace_Different_Field_Analysis() for some examples of this. Note : when sow=false and all queried fields' query structure is the same, edismax does what it has always done: produce a boolean query containing one dismax query per term. There is a new test suite TestMultiWordSynonyms that shows multi-term source synonyms matching at query-time. In order to deal with the set query changes introduced by SOLR-9786 , I extended SolrQueryParserBase.RawQuery to hold an array of terms, to enable their later consumption as either a concatenated string (for tokenized fields) or individually (for non-tokenized fields). As noted on LUCENE-7533 for Lucene's classic query parser (and equally applicable to the Solr standard and edismax query parsers), split-on-whitespace=false and autoGeneratePhraseQueries=true don't play well together at present. I've introduced a new exception QueryParserConfigurationException that will be thrown if any queried field is configured with autoGeneratePhraseQueries=true when the sow=false request param is specified. For edismax, this is a departure: it's supposed to never throw exceptions. I think that's okay for now though, since this is an optional/experimental feature. Maybe when sow=false becomes the default (later, under another issue - see below), edismax should just log a warning and produce a query that excludes fields with this problem? After this has been committed, I'll make a new issue to switch the default behavior on 7.0/master to sow=false .
          Hide
          erickerickson Erick Erickson added a comment -

          Happened across SOLR-4381 and SOLR-5379 while searching for this JIRA and thought we should check how/if they're related

          Show
          erickerickson Erick Erickson added a comment - Happened across SOLR-4381 and SOLR-5379 while searching for this JIRA and thought we should check how/if they're related
          Hide
          hbeachey Harrison Beachey added a comment -

          I'm excited that this has had some attention this year, any word on continued progress?

          Show
          hbeachey Harrison Beachey added a comment - I'm excited that this has had some attention this year, any word on continued progress?
          Hide
          steve_rowe Steve Rowe added a comment - - edited

          New patch, switches back to ignoring whitespace (along with comments).

          Added new LuceneQParser param sow (S​plit O​n W​hitespace) to control whether to split on whitespace. Defaults to SolrQueryParser.DEFAULT_SPLIT_ON_WHITESPACE (true).

          All Solr core tests pass (with existing split-on-whitespace behavior preserved as the default), and I've added a couple basic multi-word synonym tests. Needs more tests to ensure multiword analysis is properly interrupted in the presence of operators.

          Show
          steve_rowe Steve Rowe added a comment - - edited New patch, switches back to ignoring whitespace (along with comments). Added new LuceneQParser param sow ( S ​plit O ​n W ​hitespace) to control whether to split on whitespace. Defaults to SolrQueryParser.DEFAULT_SPLIT_ON_WHITESPACE (true). All Solr core tests pass (with existing split-on-whitespace behavior preserved as the default), and I've added a couple basic multi-word synonym tests. Needs more tests to ensure multiword analysis is properly interrupted in the presence of operators.
          Hide
          steve_rowe Steve Rowe added a comment -

          This parser's comment support clashes with the approach I took to handling whitespace (tokenization vs. ignoring): when a run of whitespace is interrupted by a comment, multiple WHITESPACE_SEQ tokens are generated, and the rules expect all whitespace runs to be collapsed into a single WHITESPACE_SEQ token. Thinking about a way to address this.

          Show
          steve_rowe Steve Rowe added a comment - This parser's comment support clashes with the approach I took to handling whitespace (tokenization vs. ignoring): when a run of whitespace is interrupted by a comment, multiple WHITESPACE_SEQ tokens are generated, and the rules expect all whitespace runs to be collapsed into a single WHITESPACE_SEQ token. Thinking about a way to address this.
          Hide
          steve_rowe Steve Rowe added a comment -

          Another WIP patch. Progress: parser generates (with ant javacc), compiles (after first applying the patch from LUCENE-2605 and regenerating), and most tests pass with the default split-on-whitespace option (i.e.: true - preserve old behavior). Failing tests (haven't investigated yet):

          • TestSolrQueryParser.testComments()
          • TestPostingsSolrHighlighter: testDefaultSummary() and testEmptySnippet()
          Show
          steve_rowe Steve Rowe added a comment - Another WIP patch. Progress: parser generates (with ant javacc ), compiles (after first applying the patch from LUCENE-2605 and regenerating), and most tests pass with the default split-on-whitespace option (i.e.: true - preserve old behavior). Failing tests (haven't investigated yet): TestSolrQueryParser.testComments() TestPostingsSolrHighlighter : testDefaultSummary() and testEmptySnippet()
          Hide
          steve_rowe Steve Rowe added a comment -

          Initial WIP patch, mostly just grammar changes brought over from LUCENE-2605; no tests yet, and I haven't tried to generate/compile.

          In addition to the grammar changes, I've removed the Version matchVersion param from SolrQueryParserBase.init() - it was being ignored, and the equivalent param was removed from the Lucene classic QueryParser in LUCENE-5859.

          Note that I've added an option to preserve the old behavior ({get,set}SplitOnWhitespace()) and made it default to true - it wasn't necessary to create two separate versions of the parser to enable this.

          Show
          steve_rowe Steve Rowe added a comment - Initial WIP patch, mostly just grammar changes brought over from LUCENE-2605 ; no tests yet, and I haven't tried to generate/compile. In addition to the grammar changes, I've removed the Version matchVersion param from SolrQueryParserBase.init() - it was being ignored, and the equivalent param was removed from the Lucene classic QueryParser in LUCENE-5859 . Note that I've added an option to preserve the old behavior ({ get,set}SplitOnWhitespace() ) and made it default to true - it wasn't necessary to create two separate versions of the parser to enable this.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          The current behavior might be counter to a new user's expectations, but if we just change how it works by default, a lot of existing users might suddenly be surprised by very different behavior from Solr

          +1, it's been this way since the beginning of Lucene. It's not a bug (esp since in the past multiple terms were treated as a phrase query, and still are depending on configuration - think word delimiter filter).

          I suppose we could add luceneMatchVersion-sensitive code

          That's a very blunt hammer
          It would be nice if we could selectively keep the old behavior, but I'm not sure how difficult that would be w/o duplicating the whole parser.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - The current behavior might be counter to a new user's expectations, but if we just change how it works by default, a lot of existing users might suddenly be surprised by very different behavior from Solr +1, it's been this way since the beginning of Lucene. It's not a bug (esp since in the past multiple terms were treated as a phrase query, and still are depending on configuration - think word delimiter filter). I suppose we could add luceneMatchVersion-sensitive code That's a very blunt hammer It would be nice if we could selectively keep the old behavior, but I'm not sure how difficult that would be w/o duplicating the whole parser.
          Hide
          mjsminkey Mary Jo Sminkey added a comment -

          Agree that in a case like this, an incremental approach is often best, where the new behavior is available via a setting, etc. while warning people that existing method is decremented and will be replaced in the next (or a future) major update.

          Show
          mjsminkey Mary Jo Sminkey added a comment - Agree that in a case like this, an incremental approach is often best, where the new behavior is available via a setting, etc. while warning people that existing method is decremented and will be replaced in the next (or a future) major update.
          Hide
          elyograg Shawn Heisey added a comment -

          The current behavior might be counter to a new user's expectations, but if we just change how it works by default, a lot of existing users might suddenly be surprised by very different behavior from Solr after a minor version upgrade where no config changes were made. In many cases the new behavior might be preferred, but I don't think we can assume that.

          Show
          elyograg Shawn Heisey added a comment - The current behavior might be counter to a new user's expectations, but if we just change how it works by default, a lot of existing users might suddenly be surprised by very different behavior from Solr after a minor version upgrade where no config changes were made. In many cases the new behavior might be preferred, but I don't think we can assume that.
          Hide
          steve_rowe Steve Rowe added a comment -

          I think we need an option that turns the whitespace split off.

          I disagree. I think the current behavior is counter to users' expectations, so we should just get rid of it.

          I suppose we could add luceneMatchVersion-sensitive code and include both versions, but yuck, I'd much rather not do that.

          I think the default behavior in 6.x should remain unchanged. We can change the default in master.

          I disagree. I think we should change the default behavior ASAP.

          The implementation might take a while to become bulletproof. I suspect that the query parser code relies heavily on the current behavior and that things will break in unexpected ways when changing that behavior.

          Here I agree. (e)dismax and other parsers that are based on the Solr clone of the Lucene QP will need work before this change can be released.

          Show
          steve_rowe Steve Rowe added a comment - I think we need an option that turns the whitespace split off. I disagree. I think the current behavior is counter to users' expectations, so we should just get rid of it. I suppose we could add luceneMatchVersion-sensitive code and include both versions, but yuck, I'd much rather not do that. I think the default behavior in 6.x should remain unchanged. We can change the default in master. I disagree. I think we should change the default behavior ASAP. The implementation might take a while to become bulletproof. I suspect that the query parser code relies heavily on the current behavior and that things will break in unexpected ways when changing that behavior. Here I agree. (e)dismax and other parsers that are based on the Solr clone of the Lucene QP will need work before this change can be released.
          Hide
          elyograg Shawn Heisey added a comment -

          I have often wondered whether things would work better if that behavior were changed.

          I think we need an option that turns the whitespace split off. An alternate implementation path is a completely new query parser in addition to the existing parser. Perhaps it could be named "solr".

          Whatever implementation is chosen, I think the default behavior in 6.x should remain unchanged. We can change the default in master.

          The implementation might take a while to become bulletproof. I suspect that the query parser code relies heavily on the current behavior and that things will break in unexpected ways when changing that behavior.

          Show
          elyograg Shawn Heisey added a comment - I have often wondered whether things would work better if that behavior were changed. I think we need an option that turns the whitespace split off. An alternate implementation path is a completely new query parser in addition to the existing parser. Perhaps it could be named "solr". Whatever implementation is chosen, I think the default behavior in 6.x should remain unchanged. We can change the default in master. The implementation might take a while to become bulletproof. I suspect that the query parser code relies heavily on the current behavior and that things will break in unexpected ways when changing that behavior.
          Hide
          mjsminkey Mary Jo Sminkey added a comment -

          This has been an issue for a LONG time and available solutions not always usable let alone ideal. This would be my #1 one thing to be fixed in Solr.

          Show
          mjsminkey Mary Jo Sminkey added a comment - This has been an issue for a LONG time and available solutions not always usable let alone ideal. This would be my #1 one thing to be fixed in Solr.

            People

            • Assignee:
              steve_rowe Steve Rowe
              Reporter:
              steve_rowe Steve Rowe
            • Votes:
              8 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development