Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10310

By default, stop splitting on whitespace prior to analysis in edismax and "Lucene"/standard query parsers

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      SOLR-9185 introduced an option on the edismax and standard query parsers to not perform pre-analysis whitespace splitting: the sow=false request param.

      On master/7.0, we should make sow=false the default.

      1. SOLR-10310.patch
        21 kB
        Steve Rowe
      2. SOLR-10310-fix-CopyFieldTest-failure.patch
        2 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 5ebc24a7c53b4f7fe98c0c86d53401ec9d7ecd75 in lucene-solr's branch refs/heads/branch_7_0 from Cassandra Targett
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5ebc24a ]

          SOLR-10310: Ref Guide updates, missed saving wording change in 1 file

          Show
          jira-bot ASF subversion and git services added a comment - Commit 5ebc24a7c53b4f7fe98c0c86d53401ec9d7ecd75 in lucene-solr's branch refs/heads/branch_7_0 from Cassandra Targett [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5ebc24a ] SOLR-10310 : Ref Guide updates, missed saving wording change in 1 file
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4f9907349db50df81d2c09479d5ba1e307356e8c in lucene-solr's branch refs/heads/branch_7_0 from Cassandra Targett
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4f99073 ]

          SOLR-10310: Ref Guide updates for sow param default

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4f9907349db50df81d2c09479d5ba1e307356e8c in lucene-solr's branch refs/heads/branch_7_0 from Cassandra Targett [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4f99073 ] SOLR-10310 : Ref Guide updates for sow param default
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 0f1b559850417bc17b7f34bf66c8f8488d96568e in lucene-solr's branch refs/heads/branch_7x from Cassandra Targett
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0f1b559 ]

          SOLR-10310: Ref Guide updates, missed saving wording change in 1 file

          Show
          jira-bot ASF subversion and git services added a comment - Commit 0f1b559850417bc17b7f34bf66c8f8488d96568e in lucene-solr's branch refs/heads/branch_7x from Cassandra Targett [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0f1b559 ] SOLR-10310 : Ref Guide updates, missed saving wording change in 1 file
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1cabb21d6929d954a1c588f6259777e29a934092 in lucene-solr's branch refs/heads/branch_7x from Cassandra Targett
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1cabb21 ]

          SOLR-10310: Ref Guide updates for sow param default

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1cabb21d6929d954a1c588f6259777e29a934092 in lucene-solr's branch refs/heads/branch_7x from Cassandra Targett [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1cabb21 ] SOLR-10310 : Ref Guide updates for sow param default
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 9662f2fafbdff629464fb3c248965941414b98d6 in lucene-solr's branch refs/heads/master from Cassandra Targett
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9662f2f ]

          SOLR-10310: Ref Guide updates, missed saving wording change in 1 file

          Show
          jira-bot ASF subversion and git services added a comment - Commit 9662f2fafbdff629464fb3c248965941414b98d6 in lucene-solr's branch refs/heads/master from Cassandra Targett [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9662f2f ] SOLR-10310 : Ref Guide updates, missed saving wording change in 1 file
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 5f35a08220cbb58b8243743a410ef5c293e3e9ba in lucene-solr's branch refs/heads/master from Cassandra Targett
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5f35a08 ]

          SOLR-10310: Ref Guide updates for sow param default

          Show
          jira-bot ASF subversion and git services added a comment - Commit 5f35a08220cbb58b8243743a410ef5c293e3e9ba in lucene-solr's branch refs/heads/master from Cassandra Targett [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5f35a08 ] SOLR-10310 : Ref Guide updates for sow param default
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 03b484346e65c999b97dff7d316241e2d3cf56d1 in lucene-solr's branch refs/heads/master from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=03b4843 ]

          SOLR-10310: Fix CopyFieldTest failure

          Show
          jira-bot ASF subversion and git services added a comment - Commit 03b484346e65c999b97dff7d316241e2d3cf56d1 in lucene-solr's branch refs/heads/master from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=03b4843 ] SOLR-10310 : Fix CopyFieldTest failure
          Hide
          steve_rowe Steve Rowe added a comment -

          Patch fixing the failure. Committing shortly.

          Show
          steve_rowe Steve Rowe added a comment - Patch fixing the failure. Committing shortly.
          Hide
          steve_rowe Steve Rowe added a comment -

          My Jenkins found a reproducing seed for a CopyFieldTest failure, and git bisect says that the commit on this issue is to blame - note that it reproduces only if I remove the -Dtests.method=testCatchAllCopyField from the repro line:

          Checking out Revision dd171ff8fe31df578b7e6fab1eb5bfc1ade3f5fc (refs/remotes/origin/master)
          [...]
             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=CopyFieldTest -Dtests.method=testCatchAllCopyField -Dtests.seed=27931CB10CE6100C -Dtests.slow=true -Dtests.locale=nl-BE -Dtests.timezone=Asia/Manila -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
             [junit4] ERROR   0.05s J9  | CopyFieldTest.testCatchAllCopyField <<<
             [junit4]    > Throwable #1: java.lang.RuntimeException: Exception during query
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([27931CB10CE6100C:71EDDE8B88DB61D0]:0)
             [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:896)
             [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:863)
             [junit4]    > 	at org.apache.solr.schema.CopyFieldTest.testCatchAllCopyField(CopyFieldTest.java:258)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
             [junit4]    > Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//*[@numFound='1']
             [junit4]    > 	xml response was: <?xml version="1.0" encoding="UTF-8"?>
             [junit4]    > <response>
             [junit4]    > <lst name="responseHeader"><int name="status">0</int><int name="QTime">0</int></lst><result name="response" numFound="2" start="0"><doc><int name="id">5</int><arr name="catchall_t"><str>5</str><str>10-1839ACX-93</str><str>AAM46</str><str>1565669397053308928</str></arr><arr name="sku1"><str>10-1839ACX-93</str></arr><arr name="1_s"><str>10-1839ACX-93</str></arr><arr name="1_dest_sub_s"><str>10-1839ACX-93</str></arr><arr name="dest_sub_no_ast_s"><str>10-1839ACX-93</str></arr><arr name="testing123_s"><str>AAM46</str></arr><long name="_version_">1565669397053308928</long><arr name="multiDefault"><str>muLti-Default</str></arr><int name="intDefault">42</int><date name="timestamp">2017-04-25T16:44:51.953Z</date></doc><doc><int name="id">10</int><arr name="catchall_t"><str>10</str><str>test copy field</str><str>this is a simple test of the copy field functionality</str><str>1565669397012414464</str></arr><arr name="title"><str>test copy field</str></arr><arr name="text_en"><str>this is a simple test of the copy field functionality</str></arr><arr name="highlight"><str>this is a simple test of </str><str>this is a simple test of </str></arr><long name="_version_">1565669397012414464</long><arr name="multiDefault"><str>muLti-Default</str></arr><int name="intDefault">42</int><date name="timestamp">2017-04-25T16:44:51.902Z</date></doc></result>
             [junit4]    > </response>
             [junit4]    > 	request was:q=catchall_t:10-1839ACX-93&wt=xml
             [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:889)
             [junit4]    > 	... 41 more
          [...]
             [junit4]   2> NOTE: test params are: codec=FastCompressingStoredFields(storedFieldsFormat=CompressingStoredFieldsFormat(compressionMode=FAST, chunkSize=13049, maxDocsPerChunk=9, blockSize=2), termVectorsFormat=CompressingTermVectorsFormat(compressionMode=FAST, chunkSize=13049, blockSize=2)), sim=RandomSimilarity(queryNorm=false): {}, locale=nl-BE, timezone=Asia/Manila
          

          For some reason, when sow=false, a query that used to match only the doc indexed in the failing method now also matches a doc indexed in another method, which is never removed, so if by chance the other method runs before the failing method, then this failure happens.

          I've got a patch that makes the other method use the same doc id for its indexed doc as the doc id used by the other methods, so that there's only ever one doc in the index at any given time.

          Show
          steve_rowe Steve Rowe added a comment - My Jenkins found a reproducing seed for a CopyFieldTest failure, and git bisect says that the commit on this issue is to blame - note that it reproduces only if I remove the -Dtests.method=testCatchAllCopyField from the repro line: Checking out Revision dd171ff8fe31df578b7e6fab1eb5bfc1ade3f5fc (refs/remotes/origin/master) [...] [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=CopyFieldTest -Dtests.method=testCatchAllCopyField -Dtests.seed=27931CB10CE6100C -Dtests.slow=true -Dtests.locale=nl-BE -Dtests.timezone=Asia/Manila -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] ERROR 0.05s J9 | CopyFieldTest.testCatchAllCopyField <<< [junit4] > Throwable #1: java.lang.RuntimeException: Exception during query [junit4] > at __randomizedtesting.SeedInfo.seed([27931CB10CE6100C:71EDDE8B88DB61D0]:0) [junit4] > at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:896) [junit4] > at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:863) [junit4] > at org.apache.solr.schema.CopyFieldTest.testCatchAllCopyField(CopyFieldTest.java:258) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] > Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//*[@numFound='1'] [junit4] > xml response was: <?xml version="1.0" encoding="UTF-8"?> [junit4] > <response> [junit4] > <lst name="responseHeader"><int name="status">0</int><int name="QTime">0</int></lst><result name="response" numFound="2" start="0"><doc><int name="id">5</int><arr name="catchall_t"><str>5</str><str>10-1839ACX-93</str><str>AAM46</str><str>1565669397053308928</str></arr><arr name="sku1"><str>10-1839ACX-93</str></arr><arr name="1_s"><str>10-1839ACX-93</str></arr><arr name="1_dest_sub_s"><str>10-1839ACX-93</str></arr><arr name="dest_sub_no_ast_s"><str>10-1839ACX-93</str></arr><arr name="testing123_s"><str>AAM46</str></arr><long name="_version_">1565669397053308928</long><arr name="multiDefault"><str>muLti-Default</str></arr><int name="intDefault">42</int><date name="timestamp">2017-04-25T16:44:51.953Z</date></doc><doc><int name="id">10</int><arr name="catchall_t"><str>10</str><str>test copy field</str><str>this is a simple test of the copy field functionality</str><str>1565669397012414464</str></arr><arr name="title"><str>test copy field</str></arr><arr name="text_en"><str>this is a simple test of the copy field functionality</str></arr><arr name="highlight"><str>this is a simple test of </str><str>this is a simple test of </str></arr><long name="_version_">1565669397012414464</long><arr name="multiDefault"><str>muLti-Default</str></arr><int name="intDefault">42</int><date name="timestamp">2017-04-25T16:44:51.902Z</date></doc></result> [junit4] > </response> [junit4] > request was:q=catchall_t:10-1839ACX-93&wt=xml [junit4] > at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:889) [junit4] > ... 41 more [...] [junit4] 2> NOTE: test params are: codec=FastCompressingStoredFields(storedFieldsFormat=CompressingStoredFieldsFormat(compressionMode=FAST, chunkSize=13049, maxDocsPerChunk=9, blockSize=2), termVectorsFormat=CompressingTermVectorsFormat(compressionMode=FAST, chunkSize=13049, blockSize=2)), sim=RandomSimilarity(queryNorm=false): {}, locale=nl-BE, timezone=Asia/Manila For some reason, when sow=false , a query that used to match only the doc indexed in the failing method now also matches a doc indexed in another method, which is never removed, so if by chance the other method runs before the failing method, then this failure happens. I've got a patch that makes the other method use the same doc id for its indexed doc as the doc id used by the other methods, so that there's only ever one doc in the index at any given time.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit dd171ff8fe31df578b7e6fab1eb5bfc1ade3f5fc in lucene-solr's branch refs/heads/master from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dd171ff ]

          SOLR-10310: By default, stop splitting on whitespace prior to analysis in edismax and standard/"lucene" query parsers

          Show
          jira-bot ASF subversion and git services added a comment - Commit dd171ff8fe31df578b7e6fab1eb5bfc1ade3f5fc in lucene-solr's branch refs/heads/master from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dd171ff ] SOLR-10310 : By default, stop splitting on whitespace prior to analysis in edismax and standard/"lucene" query parsers
          Hide
          steve_rowe Steve Rowe added a comment -

          Committing shortly.

          Show
          steve_rowe Steve Rowe added a comment - Committing shortly.
          Hide
          steve_rowe Steve Rowe added a comment -

          Patch switching default sow to false.

          All Solr tests pass, and precommit passes.

          I think it's ready to go, but I'll wait a few days before committing in case there are objections.

          Two behavior changes result from this switch, as illustrated by tests:

          1. When sow=false, autoGeneratePhraseQueries="true", and words are split (e.g. by WordDelimiterGraphFilter) but no overlapping terms are produced, phrase queries are not produced - see LUCENE-7799 for a possible eventual solution to this problem:

          TestSolrQueryParser.testPhrase()
          // "text" field's type has WordDelimiterGraphFilter (WDGFF) and autoGeneratePhraseQueries=true
          // should generate a phrase of "now cow" and match only one doc
          assertQ(req("q", "text:now-cow", "indent", "true", "sow","true")
              , "//*[@numFound='1']"
          );
          // When sow=false, autoGeneratePhraseQueries=true only works when a graph is produced
          // (i.e. overlapping terms, e.g. if WDGFF's preserveOriginal=1 or concatenateWords=1).
          // The WDGFF config on the "text" field doesn't produce a graph, so the generated query
          // is not a phrase query.  As a result, docs can match that don't match phrase query "now cow"
          assertQ(req("q", "text:now-cow", "indent", "true", "sow","false")
              , "//*[@numFound='2']"
          );
          assertQ(req("q", "text:now-cow", "indent", "true") // default sow=false
              , "//*[@numFound='2']"
          );
          

          2. sow=false changes the queries edismax produces over multiple fields when any of the fields’ query-time analysis differs from the other fields’, e.g. if one field’s analyzer removes stopwords when another field’s doesn’t. In this case, rather than a dismax-query-per-whitespace-separated-term (edismax’s behavior when sow=true), a dismax-query-per-field is produced. This can change results in general, but quite significantly when combined with the mm (min-should-match) request parameter: since min-should-match applies per field instead of per term, missing terms in one field’s analysis won’t disqualify docs from matching.

          TestExtendedDismaxParser.testFocusQueryParser()
          assertQ(req("defType","edismax", "mm","100%", "q","Terminator: 100", "qf","movies_t foo_i", "sow","true"),
                  nor);
          // When sow=false, the per-field query structures differ (no "Terminator" query on integer field foo_i),
          // so a dismax-per-field is constructed.  As a result, mm=100% is applied per-field instead of per-term;
          // since there is only one term (100) required in the foo_i field's dismax, the query can match docs that
          // only have the 100 term in the foo_i field, and don't necessarily have "Terminator" in any field.
          assertQ(req("defType","edismax", "mm","100%", "q","Terminator: 100", "qf","movies_t foo_i", "sow","false"),
                  oner);
          assertQ(req("defType","edismax", "mm","100%", "q","Terminator: 100", "qf","movies_t foo_i"), // default sow=false
              oner);
          
          Show
          steve_rowe Steve Rowe added a comment - Patch switching default sow to false . All Solr tests pass, and precommit passes. I think it's ready to go, but I'll wait a few days before committing in case there are objections. Two behavior changes result from this switch, as illustrated by tests: 1. When sow=false , autoGeneratePhraseQueries="true" , and words are split (e.g. by WordDelimiterGraphFilter) but no overlapping terms are produced, phrase queries are not produced - see LUCENE-7799 for a possible eventual solution to this problem: TestSolrQueryParser.testPhrase() // "text" field's type has WordDelimiterGraphFilter (WDGFF) and autoGeneratePhraseQueries= true // should generate a phrase of "now cow" and match only one doc assertQ(req( "q" , "text:now-cow" , "indent" , " true " , "sow" , " true " ) , " //*[@numFound='1']" ); // When sow= false , autoGeneratePhraseQueries= true only works when a graph is produced // (i.e. overlapping terms, e.g. if WDGFF's preserveOriginal=1 or concatenateWords=1). // The WDGFF config on the "text" field doesn't produce a graph, so the generated query // is not a phrase query. As a result, docs can match that don't match phrase query "now cow" assertQ(req( "q" , "text:now-cow" , "indent" , " true " , "sow" , " false " ) , " //*[@numFound='2']" ); assertQ(req( "q" , "text:now-cow" , "indent" , " true " ) // default sow= false , " //*[@numFound='2']" ); 2. sow=false changes the queries edismax produces over multiple fields when any of the fields’ query-time analysis differs from the other fields’, e.g. if one field’s analyzer removes stopwords when another field’s doesn’t. In this case, rather than a dismax-query-per-whitespace-separated-term (edismax’s behavior when sow=true ), a dismax-query-per-field is produced. This can change results in general, but quite significantly when combined with the mm (min-should-match) request parameter: since min-should-match applies per field instead of per term, missing terms in one field’s analysis won’t disqualify docs from matching. TestExtendedDismaxParser.testFocusQueryParser() assertQ(req( "defType" , "edismax" , "mm" , "100%" , "q" , "Terminator: 100" , "qf" , "movies_t foo_i" , "sow" , " true " ), nor); // When sow= false , the per-field query structures differ (no "Terminator" query on integer field foo_i), // so a dismax-per-field is constructed. As a result, mm=100% is applied per-field instead of per-term; // since there is only one term (100) required in the foo_i field's dismax, the query can match docs that // only have the 100 term in the foo_i field, and don't necessarily have "Terminator" in any field. assertQ(req( "defType" , "edismax" , "mm" , "100%" , "q" , "Terminator: 100" , "qf" , "movies_t foo_i" , "sow" , " false " ), oner); assertQ(req( "defType" , "edismax" , "mm" , "100%" , "q" , "Terminator: 100" , "qf" , "movies_t foo_i" ), // default sow= false oner);
          Hide
          steve_rowe Steve Rowe added a comment - - edited

          I'd like to understand your point of view here better; I'm sure you have some insight. "sow" is a query parameter.... in my view, why then wouldn't an auto-phrase-whatever also be?

          Because autoGeneratePhraseQueries is not now a query param, and the goal is to preserve its current behavior when sow=false.

          David Smiley pointed out on #solr-dev IRC:

          [I]n Solr, query parameters can be per field

          f.myfield.autoGeneratePhraseQueries

          Which directly (and correctly) contradicts my assertion ("another step back") that moving the configuration location from schema to request param would disallow per-field configuration. And I agree, autoGeneratePhraseQueries could be converted into a (optionally per-field) request param, but the current implementation is not like that, and SOLR-10357 is about fixing the current incompatibility of autoGeneratePhraseQueries="true" with sow=false, so I'd rather keep the focus where the biggest problem is, and defer the request param vs. schema discussion for later.

          Show
          steve_rowe Steve Rowe added a comment - - edited I'd like to understand your point of view here better; I'm sure you have some insight. "sow" is a query parameter.... in my view, why then wouldn't an auto-phrase-whatever also be? Because autoGeneratePhraseQueries is not now a query param, and the goal is to preserve its current behavior when sow=false . David Smiley pointed out on #solr-dev IRC: [I] n Solr, query parameters can be per field f.myfield.autoGeneratePhraseQueries Which directly (and correctly) contradicts my assertion ("another step back") that moving the configuration location from schema to request param would disallow per-field configuration. And I agree, autoGeneratePhraseQueries could be converted into a (optionally per-field) request param, but the current implementation is not like that, and SOLR-10357 is about fixing the current incompatibility of autoGeneratePhraseQueries="true" with sow=false , so I'd rather keep the focus where the biggest problem is, and defer the request param vs. schema discussion for later.
          Hide
          dsmiley David Smiley added a comment -

          Putting it (or its graph-based replacement, i.e. #3) somewhere else would be another step back.

          I'd like to understand your point of view here better; I'm sure you have some insight. "sow" is a query parameter.... in my view, why then wouldn't an auto-phrase-whatever also be?

          Show
          dsmiley David Smiley added a comment - Putting it (or its graph-based replacement, i.e. #3) somewhere else would be another step back. I'd like to understand your point of view here better; I'm sure you have some insight. "sow" is a query parameter.... in my view, why then wouldn't an auto-phrase-whatever also be?
          Hide
          steve_rowe Steve Rowe added a comment -

          I created SOLR-10357 to deal with the auto-phrase-gen situation using my #2 approach.

          Show
          steve_rowe Steve Rowe added a comment - I created SOLR-10357 to deal with the auto-phrase-gen situation using my #2 approach.
          Hide
          steve_rowe Steve Rowe added a comment -

          #2 and #3 are pretty similar and not even mutually exclusive . In #3 you're just saying give this option a more appropriate name I guess.

          #3 is essentially a back-compat break that would force users to change field type config away from autoGeneratePhraseQueries="true" to autoGenerateMultiTermSynonymsPhraseQuery="true" in order to invoke this option for graph-based queries (assuming sow=false and graph-aware token filters).

          #3 makes sense but I wonder if the FieldType is the right place for these query building options; shouldn't it be request params?

          autoGeneratePhraseQueries is already a FieldType option. Putting it (or its graph-based replacement, i.e. #3) somewhere else would be another step back.

          Show
          steve_rowe Steve Rowe added a comment - #2 and #3 are pretty similar and not even mutually exclusive . In #3 you're just saying give this option a more appropriate name I guess. #3 is essentially a back-compat break that would force users to change field type config away from autoGeneratePhraseQueries="true" to autoGenerateMultiTermSynonymsPhraseQuery="true" in order to invoke this option for graph-based queries (assuming sow=false and graph-aware token filters). #3 makes sense but I wonder if the FieldType is the right place for these query building options; shouldn't it be request params? autoGeneratePhraseQueries is already a FieldType option. Putting it (or its graph-based replacement, i.e. #3) somewhere else would be another step back.
          Hide
          dsmiley David Smiley added a comment -

          I'm glad to hear this seems to be just about configuration – that the underlying desire for both sow=false and autoGenerateMultiTermSynonymsPhraseQuery is there and should work. This should definitely get exposed!

          #2 and #3 are pretty similar and not even mutually exclusive . In #3 you're just saying give this option a more appropriate name I guess. #3 makes sense but I wonder if the FieldType is the right place for these query building options; shouldn't it be request params?

          Show
          dsmiley David Smiley added a comment - I'm glad to hear this seems to be just about configuration – that the underlying desire for both sow=false and autoGenerateMultiTermSynonymsPhraseQuery is there and should work. This should definitely get exposed! #2 and #3 are pretty similar and not even mutually exclusive . In #3 you're just saying give this option a more appropriate name I guess. #3 makes sense but I wonder if the FieldType is the right place for these query building options; shouldn't it be request params?
          Hide
          steve_rowe Steve Rowe added a comment - - edited

          QueryBuilder.autoGenerateMultiTermSynonymsPhraseQuery, introduced in LUCENE-7638, is the graph version of autoGeneratePhraseQueries.

          I guess there are three possible paths here:

          1. As hinted in SOLR-10348: avoid dealing with sow=false/auto-gen-phrase=true incompatibility by leaving the exception in place and dropping autoGeneratePhraseQueries="true" from example schemas.
          2. Re-interpret Solr's autoGeneratePhraseQueries per-fieldtype option as setting QueryBuilder.autoGenerateMultiTermSynonymsPhraseQuery when sow=false.
          3. Introduce a new (per-fieldtype?) option autoGenerateMultiTermSynonymsPhraseQuery.

          I agree, David Smiley, that my #1 above is a step back. (Though Robert Muir has claimed otherwise: https://issues.apache.org/jira/browse/LUCENE-2458?focusedCommentId=12866557&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12866557)

          I like #2 best. (Should be done on its own issue though.)

          Thoughts?

          Show
          steve_rowe Steve Rowe added a comment - - edited QueryBuilder.autoGenerateMultiTermSynonymsPhraseQuery , introduced in LUCENE-7638 , is the graph version of autoGeneratePhraseQueries . I guess there are three possible paths here: As hinted in SOLR-10348 : avoid dealing with sow=false/auto-gen-phrase=true incompatibility by leaving the exception in place and dropping autoGeneratePhraseQueries="true" from example schemas. Re-interpret Solr's autoGeneratePhraseQueries per-fieldtype option as setting QueryBuilder.autoGenerateMultiTermSynonymsPhraseQuery when sow=false . Introduce a new (per-fieldtype?) option autoGenerateMultiTermSynonymsPhraseQuery . I agree, David Smiley , that my #1 above is a step back. (Though Robert Muir has claimed otherwise: https://issues.apache.org/jira/browse/LUCENE-2458?focusedCommentId=12866557&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12866557 ) I like #2 best. (Should be done on its own issue though.) Thoughts?
          Hide
          dsmiley David Smiley added a comment -

          If this will also lose the semantics of autoGeneratePhraseQueries (as hinted in SOLR-10348 it will), I think this is a step back – at least for English/western languages, most especially (where autoGeneratePhraseQueries is recommended). The technical difficulties that led to the incompatibility of "sow=false" and autoGeneratePhraseQueries were based on Lucene being unable to treat token streams as graphs. Of course as you know, this was recently solved.

          Show
          dsmiley David Smiley added a comment - If this will also lose the semantics of autoGeneratePhraseQueries (as hinted in SOLR-10348 it will), I think this is a step back – at least for English/western languages, most especially (where autoGeneratePhraseQueries is recommended). The technical difficulties that led to the incompatibility of "sow=false" and autoGeneratePhraseQueries were based on Lucene being unable to treat token streams as graphs. Of course as you know, this was recently solved.

            People

            • Assignee:
              steve_rowe Steve Rowe
              Reporter:
              steve_rowe Steve Rowe
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development