Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10423

ShingleFilter causes overly restrictive queries to be produced

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.5
    • Fix Version/s: 6.5.1, 6.6, 7.0
    • Component/s: query parsers
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      When sow=false and ShingleFilter is included in the query analyzer, QueryBuilder produces queries that inappropriately require sequential terms. E.g. the query "A B C" produces (+A_B +B_C) A_B_C when the query analyzer includes <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="false" tokenSeparator="_"/>.

      Aman Deep Singh reported this problem on the solr-user list. From http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201703.mbox/%3cCANEGTX9BwBPwQc-cXieAc7QSAS7x2TgZovOMy5ZTiAgco1p11Q@mail.gmail.com%3e:

      I was trying to use the shingle filter but it was not creating the query as
      desirable.

      my schema is

      <fieldType name="cust_shingle" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.ShingleFilterFactory" outputUnigrams="false" maxShingleSize="4"/>
          <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
      </fieldType>
      <field name="nameShingle" type="cust_shingle" indexed="true" stored="true"/>
      

      my solr query is

      http://localhost:8983/solr/productCollection/select?
       defType=edismax
      &debugQuery=true
      &q=one%20plus%20one%20four
      &qf=nameShingle
      &sow=false
      &wt=xml
      

      and it was creating the parsed query as

      <str name="parsedquery">
      (+(DisjunctionMaxQuery(((+nameShingle:one plus +nameShingle:plus one
      +nameShingle:one four))) DisjunctionMaxQuery(((+nameShingle:one plus
      +nameShingle:plus one four))) DisjunctionMaxQuery(((+nameShingle:one plus one +nameShingle:one four))) DisjunctionMaxQuery((nameShingle:one plus one four)))~1)/no_coord
      </str>
      <str name="parsedquery_toString">
      *+((((+nameShingle:one plus +nameShingle:plus one +nameShingle:one four))
      ((+nameShingle:one plus +nameShingle:plus one four)) ((+nameShingle:one
      plus one +nameShingle:one four)) (nameShingle:one plus one four))~1)*
      </str>
      

      So ideally token creations is perfect but in the query it is using boolean + operator which is causing the problem as if i have a document with name as "one plus one" ,according to the shingles it has to matched as its token will be ("one plus","one plus one","plus one") .

      I have tried using the q.op and played around the mm also but nothing is
      giving me the correct response.

      Any idea how i can fetch that document even if the document is missing any
      token.

      My expected response will be getting the document "one plus one" even the user query has any additional term like "one plus one two" and so on.

        Activity

        Hide
        steve_rowe Steve Rowe added a comment - - edited

        I think the fix for this problem is to expose QueryBuilder.setEnableGraphQueries() on Solr field types, in the same way that the autoGeneratePhraseQueries option is now.

        Since 6.5 is the first version of Solr that included the sow=false option, it previously wasn't possible to construct queries using ShingleFilter, because Solr's query parser always split on whitespace before performing analysis, one term at a time.

        The following Lucene unit test (added to the queryparser module's TestQueryParser.java, after adding a test dependency on the analysis-common module), which calls QueryBuilder.setEnableGraphQueries(false);, succeeds for me. When I change the test to call assertQueryEquals() (which doesn't disable graph queries, which are enabled by default), the test fails with this assertion error: Query /A B C/ yielded /(+A_B +B_C) A_B_C/, expecting /Synonym(A_B A_B_C) B_C/.

          public void testShinglesSplitOnWhitespace() throws Exception {
            Analyzer a = new Analyzer() {
              @Override protected TokenStreamComponents createComponents(String s) {
                Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, false);
                ShingleFilter tokenStream = new ShingleFilter(tokenizer, 2, 3);
                tokenStream.setTokenSeparator("_");
                tokenStream.setOutputUnigrams(false);
                return new TokenStreamComponents(tokenizer, tokenStream);
              }
            };
            boolean oldSplitOnWhitespace = splitOnWhitespace;
            splitOnWhitespace = false;
            assertQueryEqualsNoGraph("A B C", a, "Synonym(A_B A_B_C) B_C");
            splitOnWhitespace = oldSplitOnWhitespace;
          }
        
          public void assertQueryEqualsNoGraph(String query, Analyzer a, String result) throws Exception {
            QueryParser parser = getParser(a);
            parser.setEnableGraphQueries(false);
            Query q = parser.parse(query);
            String s = q.toString("field");
            if (!s.equals(result)) {
              fail("Query /" + query + "/ yielded /" + s + "/, expecting /" + result + "/");
            }
          }
        
        Show
        steve_rowe Steve Rowe added a comment - - edited I think the fix for this problem is to expose QueryBuilder.setEnableGraphQueries() on Solr field types, in the same way that the autoGeneratePhraseQueries option is now. Since 6.5 is the first version of Solr that included the sow=false option, it previously wasn't possible to construct queries using ShingleFilter, because Solr's query parser always split on whitespace before performing analysis, one term at a time. The following Lucene unit test (added to the queryparser module's TestQueryParser.java , after adding a test dependency on the analysis-common module), which calls QueryBuilder.setEnableGraphQueries(false); , succeeds for me. When I change the test to call assertQueryEquals() (which doesn't disable graph queries, which are enabled by default), the test fails with this assertion error: Query /A B C/ yielded /(+A_B +B_C) A_B_C/, expecting /Synonym(A_B A_B_C) B_C/ . public void testShinglesSplitOnWhitespace() throws Exception { Analyzer a = new Analyzer() { @Override protected TokenStreamComponents createComponents( String s) { Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, false ); ShingleFilter tokenStream = new ShingleFilter(tokenizer, 2, 3); tokenStream.setTokenSeparator( "_" ); tokenStream.setOutputUnigrams( false ); return new TokenStreamComponents(tokenizer, tokenStream); } }; boolean oldSplitOnWhitespace = splitOnWhitespace; splitOnWhitespace = false ; assertQueryEqualsNoGraph( "A B C" , a, "Synonym(A_B A_B_C) B_C" ); splitOnWhitespace = oldSplitOnWhitespace; } public void assertQueryEqualsNoGraph( String query, Analyzer a, String result) throws Exception { QueryParser parser = getParser(a); parser.setEnableGraphQueries( false ); Query q = parser.parse(query); String s = q.toString( "field" ); if (!s.equals(result)) { fail( "Query /" + query + "/ yielded /" + s + "/, expecting /" + result + "/" ); } }
        Hide
        steve_rowe Steve Rowe added a comment - - edited

        Patch with suggested fix and tests: specifying <fieldtype ... enableGraphQueries="false">... allows functional queries over ShingleFilter'd fields.

        Running tests and precommit now. I'd like to include this in Solr 6.5.1.

        Show
        steve_rowe Steve Rowe added a comment - - edited Patch with suggested fix and tests: specifying <fieldtype ... enableGraphQueries="false">... allows functional queries over ShingleFilter'd fields. Running tests and precommit now. I'd like to include this in Solr 6.5.1.
        Hide
        steve_rowe Steve Rowe added a comment -

        All Solr tests and precommit pass.

        Show
        steve_rowe Steve Rowe added a comment - All Solr tests and precommit pass.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 213e50a982e7a6f4ecb0d47178e7509393b74a7a in lucene-solr's branch refs/heads/branch_6_5 from Steve Rowe
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=213e50a ]

        SOLR-10423: Disable graph query production via schema configuration <fieldtype ... enableGraphQueries="false">. This fixes broken queries for ShingleFilter-containing query-time analyzers when request param sow=false.

        Conflicts:
        solr/core/src/java/org/apache/solr/parser/QueryParser.java
        solr/core/src/java/org/apache/solr/parser/QueryParser.jj
        solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java
        solr/core/src/test/org/apache/solr/search/TestSolrQueryParser.java

        Show
        jira-bot ASF subversion and git services added a comment - Commit 213e50a982e7a6f4ecb0d47178e7509393b74a7a in lucene-solr's branch refs/heads/branch_6_5 from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=213e50a ] SOLR-10423 : Disable graph query production via schema configuration <fieldtype ... enableGraphQueries="false">. This fixes broken queries for ShingleFilter-containing query-time analyzers when request param sow=false. Conflicts: solr/core/src/java/org/apache/solr/parser/QueryParser.java solr/core/src/java/org/apache/solr/parser/QueryParser.jj solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java solr/core/src/test/org/apache/solr/search/TestSolrQueryParser.java
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 3a6fbd741110b04d590ced10375b076321fb8bf7 in lucene-solr's branch refs/heads/branch_6x from Steve Rowe
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3a6fbd7 ]

        SOLR-10423: Disable graph query production via schema configuration <fieldtype ... enableGraphQueries="false">. This fixes broken queries for ShingleFilter-containing query-time analyzers when request param sow=false.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 3a6fbd741110b04d590ced10375b076321fb8bf7 in lucene-solr's branch refs/heads/branch_6x from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3a6fbd7 ] SOLR-10423 : Disable graph query production via schema configuration <fieldtype ... enableGraphQueries="false">. This fixes broken queries for ShingleFilter-containing query-time analyzers when request param sow=false.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit dbd22a6ada774eb30aee4b9312eb7913dee6890e in lucene-solr's branch refs/heads/master from Steve Rowe
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dbd22a6 ]

        SOLR-10423: Disable graph query production via schema configuration <fieldtype ... enableGraphQueries="false">. This fixes broken queries for ShingleFilter-containing query-time analyzers when request param sow=false.

        Show
        jira-bot ASF subversion and git services added a comment - Commit dbd22a6ada774eb30aee4b9312eb7913dee6890e in lucene-solr's branch refs/heads/master from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dbd22a6 ] SOLR-10423 : Disable graph query production via schema configuration <fieldtype ... enableGraphQueries="false">. This fixes broken queries for ShingleFilter-containing query-time analyzers when request param sow=false.

          People

          • Assignee:
            steve_rowe Steve Rowe
            Reporter:
            steve_rowe Steve Rowe
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development