Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7698

CommonGramsQueryFilter in the query analyzer chain breaks phrase queries

    Details

    • Lucene Fields:
      New

      Description

      (Please pardon me if the project or component are wrong!)

      CommonGramsQueryFilter breaks phrase queries. The behavior also seems to change with addition or removal of adjacent terms.

      Steps to reproduce:

      1.) Download and extract Solr (in my test case version 6.4.1) somewhere.
      2.) Modify server/solr/configsets/sample_techproducts_configs/conf/managed-schema and modify text_general fieldType by adding CommonGrams(Query)Filter before stopWordFilter:

      <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.CommonGramsFilterFactory" ignoreCase="true" words="stopwords.txt" />
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
      <!-- in this example, we will only use synonyms at query time
      <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
      -->
      <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.CommonGramsQueryFilterFactory" ignoreCase="true" words="stopwords.txt"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      </fieldType>

      3.) Add "with" to server/solr/configsets/sample_techproducts_configs/conf/stopwords.txt and make sure the file has correct line endings (extracted from Solr zip it seems to contain DOS/Windows lien endings which may break things).

      4.) Run the techproducts example with "bin/solr -e techproducts"

      5.) Browse to <http://localhost:8983/solr/techproducts/select?q=%22iPod%20with%20Video%22&debugQuery=true>

      6.) Observe that parsedquery in the debug output is empty

      7.) Browse to <http://localhost:8983/solr/techproducts/select?q=%22Apple%2060%20GB%20iPod%20with%20Video%20Playback%20Black%22&debugQuery=true>

      8.) Observe that parsedquery contains ipod_with as expected but not with_video.

      1. LUCENE-7698.patch
        6 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          emaijala Ere Maijala added a comment - - edited

          This seems to be a regression in Solr 6.4.0. At least a quick test shows correct results in 6.3.0.

          Show
          emaijala Ere Maijala added a comment - - edited This seems to be a regression in Solr 6.4.0. At least a quick test shows correct results in 6.3.0.
          Hide
          emaijala Ere Maijala added a comment -

          Looks to me like LUCENE-7603 broke this.

          Show
          emaijala Ere Maijala added a comment - Looks to me like LUCENE-7603 broke this.
          Hide
          mikemccand Michael McCandless added a comment -

          Hmm, no good, sorry about this ... thank you for reporting this Ere Maijala; I'll try to make a Lucene test case showing this.

          Show
          mikemccand Michael McCandless added a comment - Hmm, no good, sorry about this ... thank you for reporting this Ere Maijala ; I'll try to make a Lucene test case showing this.
          Hide
          mikemccand Michael McCandless added a comment -

          OK I see what's happening: this filter (CommonGramsQueryFilter) deletes the unigram tokens, but keeps posLength=2 on the bigram tokens, which makes a disconnected graph, and then the query parser does the wrong thing.

          I think the right fix is for it to set posLength to 1 when it drops unigram tokens .. I'll work on a patch.

          Show
          mikemccand Michael McCandless added a comment - OK I see what's happening: this filter ( CommonGramsQueryFilter ) deletes the unigram tokens, but keeps posLength=2 on the bigram tokens, which makes a disconnected graph, and then the query parser does the wrong thing. I think the right fix is for it to set posLength to 1 when it drops unigram tokens .. I'll work on a patch.
          Hide
          mikemccand Michael McCandless added a comment -

          OK here's a patch fixing CommonGraphsQueryFilter to not create a disconnected graph. Ere Maijala could you please try this and see if it fixes your use case? Thanks.

          I also added an experimental option to QueryBuilder (base class for query parsers) to disable graph handling, as a safety for other tokenizer components that may create disconnected graphs.

          Show
          mikemccand Michael McCandless added a comment - OK here's a patch fixing CommonGraphsQueryFilter to not create a disconnected graph. Ere Maijala could you please try this and see if it fixes your use case? Thanks. I also added an experimental option to QueryBuilder (base class for query parsers) to disable graph handling, as a safety for other tokenizer components that may create disconnected graphs.
          Hide
          emaijala Ere Maijala added a comment -

          Michael McCandless, thanks for the fix. An initial check indicates that the patch fixes my use case. I ran the tests in branch_6x. The patch didn't quite apply cleanly to branch_6_4 and after applying manually a test didn't compile:

          common.compile-test:
              [mkdir] Created dir: /Users/eremaijala/src/solr/lucene/build/analysis/common/classes/test
              [javac] Compiling 279 source files to /Users/eremaijala/src/solr/lucene/build/analysis/common/classes/test
              [javac] /Users/eremaijala/src/solr/lucene/analysis/common/src/test/org/apache/lucene/analysis/commongrams/TestCommonGramsQueryFilterFactory.java:103: error: cannot find symbol
              [javac]     assertGraphStrings(stream, "testing_the the_factory factory works");
              [javac]     ^
              [javac]   symbol:   method assertGraphStrings(TokenStream,String)
              [javac]   location: class TestCommonGramsQueryFilterFactory
              [javac] Note: Some input files use or override a deprecated API.
              [javac] Note: Recompile with -Xlint:deprecation for details.
              [javac] 1 error
          
          Show
          emaijala Ere Maijala added a comment - Michael McCandless , thanks for the fix. An initial check indicates that the patch fixes my use case. I ran the tests in branch_6x. The patch didn't quite apply cleanly to branch_6_4 and after applying manually a test didn't compile: common.compile-test: [mkdir] Created dir: /Users/eremaijala/src/solr/lucene/build/analysis/common/classes/test [javac] Compiling 279 source files to /Users/eremaijala/src/solr/lucene/build/analysis/common/classes/test [javac] /Users/eremaijala/src/solr/lucene/analysis/common/src/test/org/apache/lucene/analysis/commongrams/TestCommonGramsQueryFilterFactory.java:103: error: cannot find symbol [javac] assertGraphStrings(stream, "testing_the the_factory factory works" ); [javac] ^ [javac] symbol: method assertGraphStrings(TokenStream, String ) [javac] location: class TestCommonGramsQueryFilterFactory [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] 1 error
          Hide
          mikemccand Michael McCandless added a comment -

          OK thanks for confirming Ere Maijala; I'll fix that test on back port.

          Show
          mikemccand Michael McCandless added a comment - OK thanks for confirming Ere Maijala ; I'll fix that test on back port.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit b9c9cddff7cef08e8b0433a203771e48e662e7b1 in lucene-solr's branch refs/heads/master from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b9c9cdd ]

          LUCENE-7698: fix CommonGramsQueryFilter to not produce a disconnected token graph

          Show
          jira-bot ASF subversion and git services added a comment - Commit b9c9cddff7cef08e8b0433a203771e48e662e7b1 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b9c9cdd ] LUCENE-7698 : fix CommonGramsQueryFilter to not produce a disconnected token graph
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit d8e493c502d234099c927339426dfe4a01a94219 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d8e493c ]

          LUCENE-7698: fix CommonGramsQueryFilter to not produce a disconnected token graph

          Show
          jira-bot ASF subversion and git services added a comment - Commit d8e493c502d234099c927339426dfe4a01a94219 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d8e493c ] LUCENE-7698 : fix CommonGramsQueryFilter to not produce a disconnected token graph
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 92ff8682b281a28f40826de4b94548671e580bd8 in lucene-solr's branch refs/heads/branch_6_4 from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=92ff868 ]

          LUCENE-7698: fix CommonGramsQueryFilter to not produce a disconnected token graph

          Show
          jira-bot ASF subversion and git services added a comment - Commit 92ff8682b281a28f40826de4b94548671e580bd8 in lucene-solr's branch refs/heads/branch_6_4 from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=92ff868 ] LUCENE-7698 : fix CommonGramsQueryFilter to not produce a disconnected token graph
          Hide
          mikemccand Michael McCandless added a comment -

          Thank you Ere Maijala!

          Show
          mikemccand Michael McCandless added a comment - Thank you Ere Maijala !

            People

            • Assignee:
              Unassigned
              Reporter:
              emaijala Ere Maijala
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development