Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7315

Flexible "standard" query parser parses on whitespace

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/queryparser
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Copied from LUCENE-2605:

      The queryparser parses input on whitespace, and sends each whitespace separated term to its own independent token stream.
      This breaks the following at query-time, because they can't see across whitespace boundaries:

      n-gram analysis
      shingles
      synonyms (especially multi-word for whitespace-separated languages)
      languages where a 'word' can contain whitespace (e.g. vietnamese)

      Its also rather unexpected, as users think their charfilters/tokenizers/tokenfilters will do the same thing at index and querytime, but in many cases they can't. Instead, preferably the queryparser would parse around only real 'operators'.

        Issue Links

          Activity

          Hide
          steve_rowe Steve Rowe added a comment -

          WIP patch against master, generated files not included (ant javacc-flexible in lucene/queryparser/ will generate them), still has nocommits and failing tests.

          In addition to enabling not splitting on whitespace prior to text analysis, the patch includes the following changes:

          • Changed TermQueryNode's positionIncrement name to position, since that's what it really holds.
          • SynonymQueryNode/Builder now produces a SynonymQuery instead of a boolean query.
          • Refactored AnalyzerQueryNodeProcessor.postProcessNode() into shorter methods and made it simpler and easier to follow.
          • Moved split-on-whitespace tests to the shared QueryParserTestBase.

          Some challenges remain:

          • Unlike the classic QP, the flexible standard QP appears to remove a top-level MUST boolean query, e.g. +(word) -> word. Some of the split-on-whitespace shared tests will need to be specialized for each parser.
          • There's no simple way to collapse the children of the boolean query produced for text containing whitespace when not splitting on whitespace into their ancestor boolean query (if there is one), so some of the shared split-on-whitespace tests are failing.
            • The patch includes a FlattenQueryNodeProcessor meant to address this issue, but it's not working and I haven't figured out why yet.
          • Recent master-only changes will likely make the branch_6x backport non-trivial, e.g LUCENE-7347.
          Show
          steve_rowe Steve Rowe added a comment - WIP patch against master, generated files not included ( ant javacc-flexible in lucene/queryparser/ will generate them), still has nocommits and failing tests. In addition to enabling not splitting on whitespace prior to text analysis, the patch includes the following changes: Changed TermQueryNode 's positionIncrement name to position , since that's what it really holds. SynonymQueryNode / Builder now produces a SynonymQuery instead of a boolean query. Refactored AnalyzerQueryNodeProcessor.postProcessNode() into shorter methods and made it simpler and easier to follow. Moved split-on-whitespace tests to the shared QueryParserTestBase . Some challenges remain: Unlike the classic QP, the flexible standard QP appears to remove a top-level MUST boolean query, e.g. +(word) -> word . Some of the split-on-whitespace shared tests will need to be specialized for each parser. There's no simple way to collapse the children of the boolean query produced for text containing whitespace when not splitting on whitespace into their ancestor boolean query (if there is one), so some of the shared split-on-whitespace tests are failing. The patch includes a FlattenQueryNodeProcessor meant to address this issue, but it's not working and I haven't figured out why yet. Recent master-only changes will likely make the branch_6x backport non-trivial, e.g LUCENE-7347 .
          Hide
          steve_rowe Steve Rowe added a comment -

          Yes.

          Show
          steve_rowe Steve Rowe added a comment - Yes.
          Hide
          mikemccand Michael McCandless added a comment -

          OK I see: this issue is about making the same fixes in LUCENE-2605, which was for the classic query parser, to the flexible query parser.

          Show
          mikemccand Michael McCandless added a comment - OK I see: this issue is about making the same fixes in LUCENE-2605 , which was for the classic query parser, to the flexible query parser.
          Hide
          mikemccand Michael McCandless added a comment -

          How does this issue differ from LUCENE-2605?

          Show
          mikemccand Michael McCandless added a comment - How does this issue differ from LUCENE-2605 ?

            People

            • Assignee:
              steve_rowe Steve Rowe
              Reporter:
              steve_rowe Steve Rowe
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development