Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7533

Classic query parser: autoGeneratePhraseQueries=true doesn't work when splitOnWhitespace=false

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.2, 6.3, 6.2.1
    • Fix Version/s: master (7.0), 6.4
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      LUCENE-2605 introduced the classic query parser option to not split on whitespace prior to performing analysis.

      From the javadocs for QueryParser.setAutoGeneratePhraseQueries():

      phrase queries will be automatically generated when the analyzer returns more than one term from whitespace delimited text.

      When splitOnWhitespace=false, the output from analysis can now come from multiple whitespace-separated tokens, which breaks code assumptions when autoGeneratePhraseQueries=true: for this combination of options, it's not appropriate to auto-quote multiple non-overlapping tokens produced by analysis. E.g. simple whitespace tokenization over the query "some words" will produce the token sequence ("some", "words"), and even when autoGeneratePhraseQueries=true, we should not be creating a phrase query here.

      1. LUCENE-7533.patch
        21 kB
        Steve Rowe
      2. LUCENE-7533-disallow-option-combo.patch
        10 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          steve_rowe Steve Rowe added a comment -

          Patch that addresses some of this issue, with some failing tests and nocommits.

          The existing autoGeneratePhraseQueries=true approach generates queries exactly as if the query had contained quotation marks, but as I mentioned above, this is inappropriate when splitOnWhitespace=false and the query text contains spaces.

          The approach in the patch is to add a new QueryBuilder method to handle the autoGeneratePhraseQueries=true case. The query text is split on whitespace and these tokens' offsets are compared to those produced by the configured analyzer. When multiple non-overlapping tokens have offsets within the bounds of a single whitespace-separated token, a phrase query is created. If the original token is present as a token overlapping with the first split token, then a disjunction query is created with the original token and the phrase query of the split tokens.

          I've added a couple of tests that show posincr/poslength/offset output from SynonymFilter and WordDelimiterFilter (likely the two most frequently used analysis components that can create split tokens), and both create corrupt token graphs of various kinds (e.g. LUCENE-6582, LUCENE-5051), so solving this problem in a complete way just isn't possible right now.

          So I'm not happy with the approach in the patch. It only covers a subset of possible token graphs (e.g. more than one overlapping multi-term synonym doesn't work). And it's a lot of new code solving a problem that AFAIK no user has reported (does anybody even use autoGeneratePhraseQueries=true with classic QP?),

          I'd be much happier if we could somehow get TermAutomatonQuery hooked into the query parsers, and then rewrite to simpler queries if possible: LUCENE-6824. First thing though is unbreaking SynonymFilter and friends to produce non-broken token graphs though. Attempts to do this for SynonymFilter have stalled though: LUCENE-6664. (I have a germ of an idea that might break the logjam - I'll post over there.)

          For this issue, maybe instead of my patch, for now, we just disallow autoGeneratePhraseQueries=true when splitOnWhitespace=false.

          Thoughts?

          Show
          steve_rowe Steve Rowe added a comment - Patch that addresses some of this issue, with some failing tests and nocommits. The existing autoGeneratePhraseQueries=true approach generates queries exactly as if the query had contained quotation marks, but as I mentioned above, this is inappropriate when splitOnWhitespace=false and the query text contains spaces. The approach in the patch is to add a new QueryBuilder method to handle the autoGeneratePhraseQueries=true case. The query text is split on whitespace and these tokens' offsets are compared to those produced by the configured analyzer. When multiple non-overlapping tokens have offsets within the bounds of a single whitespace-separated token, a phrase query is created. If the original token is present as a token overlapping with the first split token, then a disjunction query is created with the original token and the phrase query of the split tokens. I've added a couple of tests that show posincr/poslength/offset output from SynonymFilter and WordDelimiterFilter (likely the two most frequently used analysis components that can create split tokens), and both create corrupt token graphs of various kinds (e.g. LUCENE-6582 , LUCENE-5051 ), so solving this problem in a complete way just isn't possible right now. So I'm not happy with the approach in the patch. It only covers a subset of possible token graphs (e.g. more than one overlapping multi-term synonym doesn't work). And it's a lot of new code solving a problem that AFAIK no user has reported (does anybody even use autoGeneratePhraseQueries=true with classic QP?), I'd be much happier if we could somehow get TermAutomatonQuery hooked into the query parsers, and then rewrite to simpler queries if possible: LUCENE-6824 . First thing though is unbreaking SynonymFilter and friends to produce non-broken token graphs though. Attempts to do this for SynonymFilter have stalled though: LUCENE-6664 . (I have a germ of an idea that might break the logjam - I'll post over there.) For this issue, maybe instead of my patch, for now, we just disallow autoGeneratePhraseQueries=true when splitOnWhitespace=false. Thoughts?
          Hide
          steve_rowe Steve Rowe added a comment -

          FYI autoGeneratePhraseQueries was never added to the flexible query parser.

          Show
          steve_rowe Steve Rowe added a comment - FYI autoGeneratePhraseQueries was never added to the flexible query parser.
          Hide
          mikemccand Michael McCandless added a comment -

          +1 to move towards having proper graphs come out of analysis, and letting query parsers produce TAQ. I agree there is a lot of work there though

          Thank you for pointing to LUCENE-6824! I think that issue can be committed ... it had fallen past the event horizon of my TODO list. I'll revive it ...

          Show
          mikemccand Michael McCandless added a comment - +1 to move towards having proper graphs come out of analysis, and letting query parsers produce TAQ. I agree there is a lot of work there though Thank you for pointing to LUCENE-6824 ! I think that issue can be committed ... it had fallen past the event horizon of my TODO list. I'll revive it ...
          Hide
          steve_rowe Steve Rowe added a comment -

          Patch that disallows autoGeneratePhraseQueries=true when splitOnWhitespace=false.

          This is ready to go. I'm going to commit shortly.

          Show
          steve_rowe Steve Rowe added a comment - Patch that disallows autoGeneratePhraseQueries=true when splitOnWhitespace=false. This is ready to go. I'm going to commit shortly.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 6d1962a902bffbf6bea4b81b09524104140d9f73 in lucene-solr's branch refs/heads/branch_6x from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6d1962a ]

          LUCENE-7533: Classic query parser: disallow autoGeneratePhraseQueries=true when splitOnWhitespace=false (and vice-versa).

          Show
          jira-bot ASF subversion and git services added a comment - Commit 6d1962a902bffbf6bea4b81b09524104140d9f73 in lucene-solr's branch refs/heads/branch_6x from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6d1962a ] LUCENE-7533 : Classic query parser: disallow autoGeneratePhraseQueries=true when splitOnWhitespace=false (and vice-versa).
          Hide
          steve_rowe Steve Rowe added a comment - - edited

          I committed the patch to disallow this combination of options. Hopefully once we unbreak graph token streams, this can be revisited.

          Show
          steve_rowe Steve Rowe added a comment - - edited I committed the patch to disallow this combination of options. Hopefully once we unbreak graph token streams, this can be revisited.

            People

            • Assignee:
              steve_rowe Steve Rowe
              Reporter:
              steve_rowe Steve Rowe
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development