Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-11662

Make overlapping query term scoring configurable per field type

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.2, master (8.0)
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None
    • Flags:
      Patch

      Description

      This patch customizes the query-time behavior when query terms overlap positions. Right now the only option is SynonymQuery. This is a fantastic default & improvement on past versions. However, there are use cases where terms overlap positions but don't carry exact synonymy relationships. Often synonyms are actually used to model hypernym/hyponym relationships using synonyms (or other analyzers). So the individual term scores matter, with terms with higher specificity (hyponym) scoring higher than terms with lower specificity (hypernym).

      This patch adds the fieldType setting scoreOverlaps, as in:

        <fieldType name="text_general"  scoreOverlaps="pick_best"  class="solr.TextField" positionIncrementGap="100" multiValued="true">
      
      

      Valid values for scoreOverlaps are:

      as_one_term
      Default, most synonym use cases. Uses SynonymQuery
      Treats all terms as if they're exactly equivalent, with document frequency from underlying terms blended

      pick_best
      For a given document, score using the best scoring synonym (ie dismax over generated terms).
      Useful when synonyms not exactly equilevant. Instead they are used to model hypernym/hyponym relationships. Such as expanding to synonyms of where terms scores will reflect that quality
      IE this query time expansion

      tabby => tabby, cat, animal

      Searching "text", generates the dismax (text:tabby | text:cat | text:animal)

      as_distinct_terms
      (The pre 6.0 behavior.)
      Compromise between pick_best and as_oneSterm
      Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets scores stack, so documents with more tabby, cat, or animal the better w/ a bias towards the term with highest specificity
      Terms are turned into a boolean OR query, with documen frequencies not blended
      IE this query time expansion

      tabby => tabby, cat, animal

      Searching "text", generates the boolean query (text:tabby text:cat text:animal)

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user softwaredoug opened a pull request:

          https://github.com/apache/lucene-solr/pull/275

          SOLR-11662: Configurable query when terms overlap

          Modifies QueryBuilder and Solr Field Type to allow configurable overlap scoring asides from SynonymQuery

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/o19s/lucene-solr configurable-synonym-query-behavior

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/lucene-solr/pull/275.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #275


          commit d83b1300a8b469fa19e5fd9ae8264f6fa448bb18
          Author: Doug Turnbull <softwaredoug@gmail.com>
          Date: 2017-11-21T19:02:09Z

          Makes QueryBuilder synonym matching configurable

          commit f279435b46f81232181a658be5e856bdbca9924f
          Author: Doug Turnbull <softwaredoug@gmail.com>
          Date: 2017-11-21T20:03:54Z

          plumb through the field type setting

          commit 1e9e41c4cccff10effd4a29da30c378ee21dac3d
          Author: Doug Turnbull <softwaredoug@gmail.com>
          Date: 2017-11-21T20:17:56Z

          Fix enum style

          commit 8eb875fcccf533d3799b15d266c724c868e13d34
          Author: Doug Turnbull <softwaredoug@gmail.com>
          Date: 2017-11-21T21:47:04Z

          Renaming to scoreOverlaps


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user softwaredoug opened a pull request: https://github.com/apache/lucene-solr/pull/275 SOLR-11662 : Configurable query when terms overlap Modifies QueryBuilder and Solr Field Type to allow configurable overlap scoring asides from SynonymQuery You can merge this pull request into a Git repository by running: $ git pull https://github.com/o19s/lucene-solr configurable-synonym-query-behavior Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/275.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #275 commit d83b1300a8b469fa19e5fd9ae8264f6fa448bb18 Author: Doug Turnbull <softwaredoug@gmail.com> Date: 2017-11-21T19:02:09Z Makes QueryBuilder synonym matching configurable commit f279435b46f81232181a658be5e856bdbca9924f Author: Doug Turnbull <softwaredoug@gmail.com> Date: 2017-11-21T20:03:54Z plumb through the field type setting commit 1e9e41c4cccff10effd4a29da30c378ee21dac3d Author: Doug Turnbull <softwaredoug@gmail.com> Date: 2017-11-21T20:17:56Z Fix enum style commit 8eb875fcccf533d3799b15d266c724c868e13d34 Author: Doug Turnbull <softwaredoug@gmail.com> Date: 2017-11-21T21:47:04Z Renaming to scoreOverlaps
          Show
          softwaredoug Doug Turnbull added a comment - Associated pull request https://github.com/apache/lucene-solr/pull/275/files And Patch https://patch-diff.githubusercontent.com/raw/apache/lucene-solr/pull/275.patch
          Hide
          jpountz Adrien Grand added a comment -

          Can we have this SynonymQuery vs. dismax vs. BooleanQuery logic in a sub-class of QueryBuilder rather than QueryBuilder itself? Reason is that I don't think most users would need to customize this behaviour and moving it to a separate class would help keep QueryBuilder simple?

          Show
          jpountz Adrien Grand added a comment - Can we have this SynonymQuery vs. dismax vs. BooleanQuery logic in a sub-class of QueryBuilder rather than QueryBuilder itself? Reason is that I don't think most users would need to customize this behaviour and moving it to a separate class would help keep QueryBuilder simple?
          Hide
          softwaredoug Doug Turnbull added a comment -

          Thanks Adrien! Yes, it could be moved to SolrQueryParser. This would narrow the scope to just Solr, however. I would like to see this capability in Elasticsearch as well. Though that could be handled differently.

          Show
          softwaredoug Doug Turnbull added a comment - Thanks Adrien! Yes, it could be moved to SolrQueryParser. This would narrow the scope to just Solr, however. I would like to see this capability in Elasticsearch as well. Though that could be handled differently.
          Hide
          jpountz Adrien Grand added a comment -

          I think it's fine that Solr and Elasticsearch end up duplicating the functionality if it keeps Lucene simpler.

          Show
          jpountz Adrien Grand added a comment - I think it's fine that Solr and Elasticsearch end up duplicating the functionality if it keeps Lucene simpler.
          Hide
          softwaredoug Doug Turnbull added a comment -

          Great! And that would actually let me submit an ES patch in parallel... I'll update my PR/patch

          Show
          softwaredoug Doug Turnbull added a comment - Great! And that would actually let me submit an ES patch in parallel... I'll update my PR/patch
          Hide
          softwaredoug Doug Turnbull added a comment - - edited

          PR updated w/ code in Solr level, patch can be viewed here https://github.com/apache/lucene-solr/pull/275.patch

          Show
          softwaredoug Doug Turnbull added a comment - - edited PR updated w/ code in Solr level, patch can be viewed here https://github.com/apache/lucene-solr/pull/275.patch
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dsmiley commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r153109215

          — Diff: solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java —
          @@ -539,6 +564,25 @@ protected Query newRegexpQuery(Term regexp)

          { return query; }

          + @Override
          + protected Query newSynonymQuery(Term terms[]) {
          + if (scoreOverlaps == ScoreOverlaps.PICK_BEST) {
          — End diff –

          Some nitpicks here. I think a switch/case statement would better reflect this code reacts to all possibilities of scoreOverlaps. Secondly, the `new ArrayList<Query>()` could be `new ArrayList<>(terms.length)`.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r153109215 — Diff: solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java — @@ -539,6 +564,25 @@ protected Query newRegexpQuery(Term regexp) { return query; } + @Override + protected Query newSynonymQuery(Term terms[]) { + if (scoreOverlaps == ScoreOverlaps.PICK_BEST) { — End diff – Some nitpicks here. I think a switch/case statement would better reflect this code reacts to all possibilities of scoreOverlaps. Secondly, the `new ArrayList<Query>()` could be `new ArrayList<>(terms.length)`.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dsmiley commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r153109956

          — Diff: solr/core/src/test/org/apache/solr/search/TestExtendedDismaxParser.java —
          @@ -1794,6 +1798,37 @@ public void testOperatorsAndMultiWordSynonyms() throws Exception

          { ); }

          + public void testOverlapTermScoringQueries() throws Exception {
          — End diff –

          This new functionality should apply to the lucene QueryParser... wouldn't a test be better targeted there instead of edismax?

          Show
          githubbot ASF GitHub Bot added a comment - Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r153109956 — Diff: solr/core/src/test/org/apache/solr/search/TestExtendedDismaxParser.java — @@ -1794,6 +1798,37 @@ public void testOperatorsAndMultiWordSynonyms() throws Exception { ); } + public void testOverlapTermScoringQueries() throws Exception { — End diff – This new functionality should apply to the lucene QueryParser... wouldn't a test be better targeted there instead of edismax?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dsmiley commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r153109462

          — Diff: solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java —
          @@ -330,6 +339,19 @@ public void setAllowSubQueryParsing(boolean allowSubQueryParsing)

          { this.allowSubQueryParsing = allowSubQueryParsing; }

          + /**
          + * Set how overlapping query terms should be scored, as if they're the same term,
          — End diff –

          I think some reference to "synonyms" here would be helpful to people understanding, even if this applies to cases that aren't necessarily synonyms in the strict sense. For example after "overlapping query terms" add a parenthetical: "(e.g. synonyms)"

          Heck, maybe we should call this `SynonymQueryStyle`? After all, we're overriding `newSynonymQuery` to do the work, thus Lucene has picked the name for us.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r153109462 — Diff: solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java — @@ -330,6 +339,19 @@ public void setAllowSubQueryParsing(boolean allowSubQueryParsing) { this.allowSubQueryParsing = allowSubQueryParsing; } + /** + * Set how overlapping query terms should be scored, as if they're the same term, — End diff – I think some reference to "synonyms" here would be helpful to people understanding, even if this applies to cases that aren't necessarily synonyms in the strict sense. For example after "overlapping query terms" add a parenthetical: "(e.g. synonyms)" Heck, maybe we should call this `SynonymQueryStyle`? After all, we're overriding `newSynonymQuery` to do the work, thus Lucene has picked the name for us.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dsmiley commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r153110109

          — Diff: solr/core/src/java/org/apache/solr/schema/FieldType.java —
          @@ -905,6 +905,7 @@ protected void checkSupportsDocValues() {
          protected static final String ENABLE_GRAPH_QUERIES = "enableGraphQueries";
          private static final String ARGS = "args";
          private static final String POSITION_INCREMENT_GAP = "positionIncrementGap";
          + protected static final String SCORE_OVERLAPS = "scoreOverlaps";
          — End diff –

          Perhaps this ought to be a new parameter instead so that it's easier to toggle? I suspect you've thought of this already and I'm curious about your rationale.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r153110109 — Diff: solr/core/src/java/org/apache/solr/schema/FieldType.java — @@ -905,6 +905,7 @@ protected void checkSupportsDocValues() { protected static final String ENABLE_GRAPH_QUERIES = "enableGraphQueries"; private static final String ARGS = "args"; private static final String POSITION_INCREMENT_GAP = "positionIncrementGap"; + protected static final String SCORE_OVERLAPS = "scoreOverlaps"; — End diff – Perhaps this ought to be a new parameter instead so that it's easier to toggle? I suspect you've thought of this already and I'm curious about your rationale.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dsmiley commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r153109304

          — Diff: solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java —
          @@ -78,6 +79,14 @@
          static final int MOD_NOT = 10;
          static final int MOD_REQ = 11;

          + protected ScoreOverlaps scoreOverlaps = ScoreOverlaps.AS_SAME_TERM;
          +
          + public static enum ScoreOverlaps {
          — End diff –

          the docs on these should be actual javadocs with `

          {@link classname}

          ` for the implementation classes

          Show
          githubbot ASF GitHub Bot added a comment - Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r153109304 — Diff: solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java — @@ -78,6 +79,14 @@ static final int MOD_NOT = 10; static final int MOD_REQ = 11; + protected ScoreOverlaps scoreOverlaps = ScoreOverlaps.AS_SAME_TERM; + + public static enum ScoreOverlaps { — End diff – the docs on these should be actual javadocs with ` {@link classname} ` for the implementation classes
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dsmiley commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r153109827

          — Diff: solr/core/src/test/org/apache/solr/search/TestExtendedDismaxParser.java —
          @@ -1794,6 +1798,37 @@ public void testOperatorsAndMultiWordSynonyms() throws Exception

          { ); }

          + public void testOverlapTermScoringQueries() throws Exception {
          + ModifiableSolrParams edismaxParams = new ModifiableSolrParams();
          + edismaxParams.add("qf", "t_pick_best_foo");
          +
          + QParser qParser = QParser.getParser("tabby", "edismax", req(edismaxParams));
          + Query q = qParser.getQuery();
          + assertEquals("+((t_pick_best_foo:tabbi | t_pick_best_foo:cat | t_pick_best_foo:felin | t_pick_best_foo:anim))", q.toString());
          +
          + edismaxParams = new ModifiableSolrParams();
          — End diff –

          Solr tests have a `params()` method which is much more concise and doesn't pollute the variable namespace unnecessarily

          Show
          githubbot ASF GitHub Bot added a comment - Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r153109827 — Diff: solr/core/src/test/org/apache/solr/search/TestExtendedDismaxParser.java — @@ -1794,6 +1798,37 @@ public void testOperatorsAndMultiWordSynonyms() throws Exception { ); } + public void testOverlapTermScoringQueries() throws Exception { + ModifiableSolrParams edismaxParams = new ModifiableSolrParams(); + edismaxParams.add("qf", "t_pick_best_foo"); + + QParser qParser = QParser.getParser("tabby", "edismax", req(edismaxParams)); + Query q = qParser.getQuery(); + assertEquals("+((t_pick_best_foo:tabbi | t_pick_best_foo:cat | t_pick_best_foo:felin | t_pick_best_foo:anim))", q.toString()); + + edismaxParams = new ModifiableSolrParams(); — End diff – Solr tests have a `params()` method which is much more concise and doesn't pollute the variable namespace unnecessarily
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user softwaredoug commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r153369232

          — Diff: solr/core/src/java/org/apache/solr/schema/FieldType.java —
          @@ -905,6 +905,7 @@ protected void checkSupportsDocValues() {
          protected static final String ENABLE_GRAPH_QUERIES = "enableGraphQueries";
          private static final String ARGS = "args";
          private static final String POSITION_INCREMENT_GAP = "positionIncrementGap";
          + protected static final String SCORE_OVERLAPS = "scoreOverlaps";
          — End diff –

          I have been thinking a lot about this!

          To do the latter, ideally you'd have an API that could let you see multiple views/configs on the same field, such as the following which would search two query-time versions of the actor field

          `q=action movies&qf=actor_syn actor_nosyn^10 title text&defType=edismax&qf.actor_nosyn.field=actor&qf.actor_nosyn.analyzer=without_synonyms&qf.actor_syn.field=actor&qf.actor_syn.analyzer=with_synonyms&qf.actor_syn&scoreOverlaps=pick_best`

          I think this sort of syntax could be extremely powerful, and deal with the ability to configure multiple query time analyzers. But a bridge too far for this PR...

          Show
          githubbot ASF GitHub Bot added a comment - Github user softwaredoug commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r153369232 — Diff: solr/core/src/java/org/apache/solr/schema/FieldType.java — @@ -905,6 +905,7 @@ protected void checkSupportsDocValues() { protected static final String ENABLE_GRAPH_QUERIES = "enableGraphQueries"; private static final String ARGS = "args"; private static final String POSITION_INCREMENT_GAP = "positionIncrementGap"; + protected static final String SCORE_OVERLAPS = "scoreOverlaps"; — End diff – I have been thinking a lot about this! Solr currently exposes per-field query configuration as a fieldType param, not query time (see [autoGeneratePhraseQueries and enableGraphQueries] ( https://lucene.apache.org/solr/guide/6_6/field-type-definitions-and-properties.html#general-properties ). Solr doesn't yet have a way to pass per-field configuration at query time (my email about multiple analyzers proposes one system for doing this) To do the latter, ideally you'd have an API that could let you see multiple views/configs on the same field, such as the following which would search two query-time versions of the actor field `q=action movies&qf=actor_syn actor_nosyn^10 title text&defType=edismax&qf.actor_nosyn.field=actor&qf.actor_nosyn.analyzer=without_synonyms&qf.actor_syn.field=actor&qf.actor_syn.analyzer=with_synonyms&qf.actor_syn&scoreOverlaps=pick_best` I think this sort of syntax could be extremely powerful, and deal with the ability to configure multiple query time analyzers. But a bridge too far for this PR...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user softwaredoug commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r153372491

          — Diff: solr/core/src/test/org/apache/solr/search/TestExtendedDismaxParser.java —
          @@ -1794,6 +1798,37 @@ public void testOperatorsAndMultiWordSynonyms() throws Exception

          { ); }

          + public void testOverlapTermScoringQueries() throws Exception {
          — End diff –

          It could go either place, I put it here based on following the work for adding autoGeneratePhraseQueries

          Show
          githubbot ASF GitHub Bot added a comment - Github user softwaredoug commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r153372491 — Diff: solr/core/src/test/org/apache/solr/search/TestExtendedDismaxParser.java — @@ -1794,6 +1798,37 @@ public void testOperatorsAndMultiWordSynonyms() throws Exception { ); } + public void testOverlapTermScoringQueries() throws Exception { — End diff – It could go either place, I put it here based on following the work for adding autoGeneratePhraseQueries
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dsmiley commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r153387514

          — Diff: solr/core/src/java/org/apache/solr/schema/FieldType.java —
          @@ -905,6 +905,7 @@ protected void checkSupportsDocValues() {
          protected static final String ENABLE_GRAPH_QUERIES = "enableGraphQueries";
          private static final String ARGS = "args";
          private static final String POSITION_INCREMENT_GAP = "positionIncrementGap";
          + protected static final String SCORE_OVERLAPS = "scoreOverlaps";
          — End diff –

          I need to correct you one one point: Solr has had a syntax for per-field query parameters for a long time. The syntax is `f.fieldName.parameterName` e.g. `f.title.hl.snippets` SolrJ's SolrParams has convenience methods for this on the implementation side. Perhaps you overlooked this because most users only use it in the context of faceting parameters, even though it's certainly not unique to faceting (as in the example above for highlighting). I'm not aware of any query parser that uses it yet but they certainly could.

          Any way, I suppose even if we agree we'd like some query time customizability of this (and other settings), it would still be nice to establish a default fallback on the FieldType.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r153387514 — Diff: solr/core/src/java/org/apache/solr/schema/FieldType.java — @@ -905,6 +905,7 @@ protected void checkSupportsDocValues() { protected static final String ENABLE_GRAPH_QUERIES = "enableGraphQueries"; private static final String ARGS = "args"; private static final String POSITION_INCREMENT_GAP = "positionIncrementGap"; + protected static final String SCORE_OVERLAPS = "scoreOverlaps"; — End diff – I need to correct you one one point: Solr has had a syntax for per-field query parameters for a long time. The syntax is `f.fieldName.parameterName` e.g. `f.title.hl.snippets` SolrJ's SolrParams has convenience methods for this on the implementation side. Perhaps you overlooked this because most users only use it in the context of faceting parameters, even though it's certainly not unique to faceting (as in the example above for highlighting). I'm not aware of any query parser that uses it yet but they certainly could. Any way, I suppose even if we agree we'd like some query time customizability of this (and other settings), it would still be nice to establish a default fallback on the FieldType.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dsmiley commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r153388267

          — Diff: solr/core/src/test/org/apache/solr/search/TestExtendedDismaxParser.java —
          @@ -1794,6 +1798,37 @@ public void testOperatorsAndMultiWordSynonyms() throws Exception

          { ); }

          + public void testOverlapTermScoringQueries() throws Exception {
          — End diff –

          I see. Nonetheless I think it belongs in TestSolrQueryParser. I'd rather edismax tests stick to testing edismax and not LuceneQParser/SolrQueryParser stuff unless it's incidental.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r153388267 — Diff: solr/core/src/test/org/apache/solr/search/TestExtendedDismaxParser.java — @@ -1794,6 +1798,37 @@ public void testOperatorsAndMultiWordSynonyms() throws Exception { ); } + public void testOverlapTermScoringQueries() throws Exception { — End diff – I see. Nonetheless I think it belongs in TestSolrQueryParser. I'd rather edismax tests stick to testing edismax and not LuceneQParser/SolrQueryParser stuff unless it's incidental.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user softwaredoug commented on the issue:

          https://github.com/apache/lucene-solr/pull/275

          Updated from your review @dsmiley, let me know what you think of the name change to synonymQueryStyle, specifically let me know [how this reads](https://github.com/o19s/lucene-solr/blob/configurable-synonym-query-behavior/solr/core/src/test-files/solr/collection1/conf/schema12.xml#L171). I think the name is better, but I wonder with "synonymQueryStyle" if we should call the values something else? I may be overthinking it

          Show
          githubbot ASF GitHub Bot added a comment - Github user softwaredoug commented on the issue: https://github.com/apache/lucene-solr/pull/275 Updated from your review @dsmiley, let me know what you think of the name change to synonymQueryStyle, specifically let me know [how this reads] ( https://github.com/o19s/lucene-solr/blob/configurable-synonym-query-behavior/solr/core/src/test-files/solr/collection1/conf/schema12.xml#L171 ). I think the name is better, but I wonder with "synonymQueryStyle" if we should call the values something else? I may be overthinking it
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dsmiley commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r154456181

          — Diff: solr/core/src/test/org/apache/solr/search/TestSolrQueryParser.java —
          @@ -1057,7 +1057,35 @@ public void testShingleQueries() throws Exception

          { , "/response/numFound==1" ); }
          • +
            +
            + public void testSynonymQueryStyle() throws Exception {
            + ModifiableSolrParams edismaxParams = params("qf", "t_pick_best_foo");
            +
            + QParser qParser = QParser.getParser("tabby", "edismax", req(edismaxParams));

              • End diff –

          Why not the default/lucene query parser? That's what TestSolrQueryParser tests.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r154456181 — Diff: solr/core/src/test/org/apache/solr/search/TestSolrQueryParser.java — @@ -1057,7 +1057,35 @@ public void testShingleQueries() throws Exception { , "/response/numFound==1" ); } + + + public void testSynonymQueryStyle() throws Exception { + ModifiableSolrParams edismaxParams = params("qf", "t_pick_best_foo"); + + QParser qParser = QParser.getParser("tabby", "edismax", req(edismaxParams)); End diff – Why not the default/lucene query parser? That's what TestSolrQueryParser tests.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dsmiley commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r154457666

          — Diff: solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java —
          @@ -539,6 +591,27 @@ protected Query newRegexpQuery(Term regexp)

          { return query; }

          + @Override
          + protected Query newSynonymQuery(Term terms[]) {
          + switch (synonymQueryStyle) {
          + case PICK_BEST:
          + List<Query> currPosnClauses = new ArrayList<Query>(terms.length);
          + for (Term term : terms)

          { + currPosnClauses.add(newTermQuery(term)); + }

          + DisjunctionMaxQuery dm = new DisjunctionMaxQuery(currPosnClauses, 0.0f);
          + return dm;
          + case AS_DISTINCT_TERMS:
          + BooleanQuery.Builder builder = new BooleanQuery.Builder();
          + for (Term term : terms)

          { + builder.add(newTermQuery(term), BooleanClause.Occur.SHOULD); + }

          + return builder.build();
          + default:
          — End diff –

          What I meant to say in my previous review here is that you would have a case statement for AS_SAME_TERM and then to satisfy Java, add a default that throws an assertion error. This way we see all 3 enum vals with their own case, which I think is easier to understand/maintain. Oh, are you're doing this to handle "null"? Hmm. Maybe put the case immediately before your current "default"? Or prevent null in the first place? Either I guess... nulls are unfortunate; I like to avoid them. Notice TextField has primitives for some of its other settings; it'd be nice if likewise we had a non-null value for TextField.synonymQueryStyle.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r154457666 — Diff: solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java — @@ -539,6 +591,27 @@ protected Query newRegexpQuery(Term regexp) { return query; } + @Override + protected Query newSynonymQuery(Term terms[]) { + switch (synonymQueryStyle) { + case PICK_BEST: + List<Query> currPosnClauses = new ArrayList<Query>(terms.length); + for (Term term : terms) { + currPosnClauses.add(newTermQuery(term)); + } + DisjunctionMaxQuery dm = new DisjunctionMaxQuery(currPosnClauses, 0.0f); + return dm; + case AS_DISTINCT_TERMS: + BooleanQuery.Builder builder = new BooleanQuery.Builder(); + for (Term term : terms) { + builder.add(newTermQuery(term), BooleanClause.Occur.SHOULD); + } + return builder.build(); + default: — End diff – What I meant to say in my previous review here is that you would have a case statement for AS_SAME_TERM and then to satisfy Java, add a default that throws an assertion error. This way we see all 3 enum vals with their own case, which I think is easier to understand/maintain. Oh, are you're doing this to handle "null"? Hmm. Maybe put the case immediately before your current "default"? Or prevent null in the first place? Either I guess... nulls are unfortunate; I like to avoid them. Notice TextField has primitives for some of its other settings; it'd be nice if likewise we had a non-null value for TextField.synonymQueryStyle.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dsmiley commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r154458145

          — Diff: solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java —
          @@ -78,6 +81,39 @@
          static final int MOD_NOT = 10;
          static final int MOD_REQ = 11;

          + protected SynonymQueryStyle synonymQueryStyle = AS_SAME_TERM;
          +
          + /**
          + * Query strategy when analyzed query terms overlap the same position (ie synonyms)
          + * consider if pants and khakis are query time synonyms
          + *
          + * <li>

          {@link #AS_SAME_TERM}

          </li>
          + * <li>

          {@link #PICK_BEST}

          </li>
          + * <li>

          {@link #AS_DISTINCT_TERMS}

          </li>
          + */
          + public static enum SynonymQueryStyle {
          — End diff –

          I like the new name, and thanks for improving the javadocs. BTW that "li" HTML list is missing the "<ul> wrapper. Or better IMO is simply drop this list; it has no value I think.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r154458145 — Diff: solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java — @@ -78,6 +81,39 @@ static final int MOD_NOT = 10; static final int MOD_REQ = 11; + protected SynonymQueryStyle synonymQueryStyle = AS_SAME_TERM; + + /** + * Query strategy when analyzed query terms overlap the same position (ie synonyms) + * consider if pants and khakis are query time synonyms + * + * <li> {@link #AS_SAME_TERM} </li> + * <li> {@link #PICK_BEST} </li> + * <li> {@link #AS_DISTINCT_TERMS} </li> + */ + public static enum SynonymQueryStyle { — End diff – I like the new name, and thanks for improving the javadocs. BTW that "li" HTML list is missing the "<ul> wrapper. Or better IMO is simply drop this list; it has no value I think.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dsmiley commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r154456952

          — Diff: solr/core/src/test/org/apache/solr/search/TestSolrQueryParser.java —
          @@ -1057,7 +1057,35 @@ public void testShingleQueries() throws Exception

          { , "/response/numFound==1" ); }
          • +
            +
            + public void testSynonymQueryStyle() throws Exception {
            + ModifiableSolrParams edismaxParams = params("qf", "t_pick_best_foo");

              • End diff –

          Just a minor point here but you needn't have a SolrParams based variable; you could simply inline it at each invocation. This makes it easier to read each test request. If you were trying to share some common params across test invocations then I could understand.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r154456952 — Diff: solr/core/src/test/org/apache/solr/search/TestSolrQueryParser.java — @@ -1057,7 +1057,35 @@ public void testShingleQueries() throws Exception { , "/response/numFound==1" ); } + + + public void testSynonymQueryStyle() throws Exception { + ModifiableSolrParams edismaxParams = params("qf", "t_pick_best_foo"); End diff – Just a minor point here but you needn't have a SolrParams based variable; you could simply inline it at each invocation. This makes it easier to read each test request. If you were trying to share some common params across test invocations then I could understand.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user softwaredoug commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r154483628

          — Diff: solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java —
          @@ -539,6 +591,27 @@ protected Query newRegexpQuery(Term regexp)

          { return query; }

          + @Override
          + protected Query newSynonymQuery(Term terms[]) {
          + switch (synonymQueryStyle) {
          + case PICK_BEST:
          + List<Query> currPosnClauses = new ArrayList<Query>(terms.length);
          + for (Term term : terms)

          { + currPosnClauses.add(newTermQuery(term)); + }

          + DisjunctionMaxQuery dm = new DisjunctionMaxQuery(currPosnClauses, 0.0f);
          + return dm;
          + case AS_DISTINCT_TERMS:
          + BooleanQuery.Builder builder = new BooleanQuery.Builder();
          + for (Term term : terms)

          { + builder.add(newTermQuery(term), BooleanClause.Occur.SHOULD); + }

          + return builder.build();
          + default:
          — End diff –

          I don't think synonymQueryStyle should ever be null (should default to AS_SAME_TERM)

          Show
          githubbot ASF GitHub Bot added a comment - Github user softwaredoug commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r154483628 — Diff: solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java — @@ -539,6 +591,27 @@ protected Query newRegexpQuery(Term regexp) { return query; } + @Override + protected Query newSynonymQuery(Term terms[]) { + switch (synonymQueryStyle) { + case PICK_BEST: + List<Query> currPosnClauses = new ArrayList<Query>(terms.length); + for (Term term : terms) { + currPosnClauses.add(newTermQuery(term)); + } + DisjunctionMaxQuery dm = new DisjunctionMaxQuery(currPosnClauses, 0.0f); + return dm; + case AS_DISTINCT_TERMS: + BooleanQuery.Builder builder = new BooleanQuery.Builder(); + for (Term term : terms) { + builder.add(newTermQuery(term), BooleanClause.Occur.SHOULD); + } + return builder.build(); + default: — End diff – I don't think synonymQueryStyle should ever be null (should default to AS_SAME_TERM)
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user softwaredoug commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r154483649

          — Diff: solr/core/src/test/org/apache/solr/search/TestSolrQueryParser.java —
          @@ -1057,7 +1057,35 @@ public void testShingleQueries() throws Exception

          { , "/response/numFound==1" ); }
          • +
            +
            + public void testSynonymQueryStyle() throws Exception {
            + ModifiableSolrParams edismaxParams = params("qf", "t_pick_best_foo");
            +
            + QParser qParser = QParser.getParser("tabby", "edismax", req(edismaxParams));

              • End diff –

          whoops, good catch

          Show
          githubbot ASF GitHub Bot added a comment - Github user softwaredoug commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r154483649 — Diff: solr/core/src/test/org/apache/solr/search/TestSolrQueryParser.java — @@ -1057,7 +1057,35 @@ public void testShingleQueries() throws Exception { , "/response/numFound==1" ); } + + + public void testSynonymQueryStyle() throws Exception { + ModifiableSolrParams edismaxParams = params("qf", "t_pick_best_foo"); + + QParser qParser = QParser.getParser("tabby", "edismax", req(edismaxParams)); End diff – whoops, good catch
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user softwaredoug commented on the issue:

          https://github.com/apache/lucene-solr/pull/275

          Ascii docs updated, though I was not able to build the docs locally. Thanks @dsmiley

          Show
          githubbot ASF GitHub Bot added a comment - Github user softwaredoug commented on the issue: https://github.com/apache/lucene-solr/pull/275 Ascii docs updated, though I was not able to build the docs locally. Thanks @dsmiley
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ctargett commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r154534656

          — Diff: solr/solr-ref-guide/src/field-type-definitions-and-properties.adoc —
          @@ -87,6 +87,13 @@ For multivalued fields, specifies a distance between multiple values, which prev

          `autoGeneratePhraseQueries`:: For text fields. If `true`, Solr automatically generates phrase queries for adjacent terms. If `false`, terms must be enclosed in double-quotes to be treated as phrases.

          +`synonymQueryStyle`::
          +Query used to combine scores of overlapping query terms (ie synonyms). Consider a search for "blue tee" with query-time synonyms `tshirt,tee`.
          — End diff –

          Our convention is to use "i.e.," instead of just "ie".

          Show
          githubbot ASF GitHub Bot added a comment - Github user ctargett commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r154534656 — Diff: solr/solr-ref-guide/src/field-type-definitions-and-properties.adoc — @@ -87,6 +87,13 @@ For multivalued fields, specifies a distance between multiple values, which prev `autoGeneratePhraseQueries`:: For text fields. If `true`, Solr automatically generates phrase queries for adjacent terms. If `false`, terms must be enclosed in double-quotes to be treated as phrases. +`synonymQueryStyle`:: +Query used to combine scores of overlapping query terms (ie synonyms). Consider a search for "blue tee" with query-time synonyms `tshirt,tee`. — End diff – Our convention is to use "i.e.," instead of just "ie".
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ctargett commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r154534686

          — Diff: solr/solr-ref-guide/src/field-type-definitions-and-properties.adoc —
          @@ -87,6 +87,13 @@ For multivalued fields, specifies a distance between multiple values, which prev

          `autoGeneratePhraseQueries`:: For text fields. If `true`, Solr automatically generates phrase queries for adjacent terms. If `false`, terms must be enclosed in double-quotes to be treated as phrases.

          +`synonymQueryStyle`::
          +Query used to combine scores of overlapping query terms (ie synonyms). Consider a search for "blue tee" with query-time synonyms `tshirt,tee`.
          ++
          +Use `as_same_term` (default) to blend terms, ie `SynonymQuery(tshirt,tee)` where each term will be treated as equally important. Use `pick_best` to select the most significant synonym when scoring `Dismax(tee,tshirt)`. Use `as_distinct_terms` to bias scoring towards the most significant synonym `(pants OR slacks)`.
          ++
          +`as_same_term` is appropriatte when terms are true synonyms (television, tv). `pick_best` and `as_distinct_terms` are appropriatte when synonyms are expanding to hyponyms (q=jeans w/ jeans=>jeans,pants) and you want exact to come before parent and sibling concepts. See this http://opensourceconnections.com/blog/2017/11/21/solr-synonyms-mea-culpa/[blog article].
          — End diff –

          Is "appropriate" spelled wrong (with an extra 't')? It's done twice so I'm not sure if I'm perhaps misunderstanding the context.

          Show
          githubbot ASF GitHub Bot added a comment - Github user ctargett commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r154534686 — Diff: solr/solr-ref-guide/src/field-type-definitions-and-properties.adoc — @@ -87,6 +87,13 @@ For multivalued fields, specifies a distance between multiple values, which prev `autoGeneratePhraseQueries`:: For text fields. If `true`, Solr automatically generates phrase queries for adjacent terms. If `false`, terms must be enclosed in double-quotes to be treated as phrases. +`synonymQueryStyle`:: +Query used to combine scores of overlapping query terms (ie synonyms). Consider a search for "blue tee" with query-time synonyms `tshirt,tee`. ++ +Use `as_same_term` (default) to blend terms, ie `SynonymQuery(tshirt,tee)` where each term will be treated as equally important. Use `pick_best` to select the most significant synonym when scoring `Dismax(tee,tshirt)`. Use `as_distinct_terms` to bias scoring towards the most significant synonym `(pants OR slacks)`. ++ +`as_same_term` is appropriatte when terms are true synonyms (television, tv). `pick_best` and `as_distinct_terms` are appropriatte when synonyms are expanding to hyponyms (q=jeans w/ jeans=>jeans,pants) and you want exact to come before parent and sibling concepts. See this http://opensourceconnections.com/blog/2017/11/21/solr-synonyms-mea-culpa/[blog article]. — End diff – Is "appropriate" spelled wrong (with an extra 't')? It's done twice so I'm not sure if I'm perhaps misunderstanding the context.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user softwaredoug commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/275#discussion_r154540898

          — Diff: solr/solr-ref-guide/src/field-type-definitions-and-properties.adoc —
          @@ -87,6 +87,13 @@ For multivalued fields, specifies a distance between multiple values, which prev

          `autoGeneratePhraseQueries`:: For text fields. If `true`, Solr automatically generates phrase queries for adjacent terms. If `false`, terms must be enclosed in double-quotes to be treated as phrases.

          +`synonymQueryStyle`::
          +Query used to combine scores of overlapping query terms (ie synonyms). Consider a search for "blue tee" with query-time synonyms `tshirt,tee`.
          ++
          +Use `as_same_term` (default) to blend terms, ie `SynonymQuery(tshirt,tee)` where each term will be treated as equally important. Use `pick_best` to select the most significant synonym when scoring `Dismax(tee,tshirt)`. Use `as_distinct_terms` to bias scoring towards the most significant synonym `(pants OR slacks)`.
          ++
          +`as_same_term` is appropriatte when terms are true synonyms (television, tv). `pick_best` and `as_distinct_terms` are appropriatte when synonyms are expanding to hyponyms (q=jeans w/ jeans=>jeans,pants) and you want exact to come before parent and sibling concepts. See this http://opensourceconnections.com/blog/2017/11/21/solr-synonyms-mea-culpa/[blog article].
          — End diff –

          Thanks @ctargett, this is one of those words I consistently misspell. Github spellchecking failed me, so I brought it down and double checked/fixed the spelling.

          Show
          githubbot ASF GitHub Bot added a comment - Github user softwaredoug commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/275#discussion_r154540898 — Diff: solr/solr-ref-guide/src/field-type-definitions-and-properties.adoc — @@ -87,6 +87,13 @@ For multivalued fields, specifies a distance between multiple values, which prev `autoGeneratePhraseQueries`:: For text fields. If `true`, Solr automatically generates phrase queries for adjacent terms. If `false`, terms must be enclosed in double-quotes to be treated as phrases. +`synonymQueryStyle`:: +Query used to combine scores of overlapping query terms (ie synonyms). Consider a search for "blue tee" with query-time synonyms `tshirt,tee`. ++ +Use `as_same_term` (default) to blend terms, ie `SynonymQuery(tshirt,tee)` where each term will be treated as equally important. Use `pick_best` to select the most significant synonym when scoring `Dismax(tee,tshirt)`. Use `as_distinct_terms` to bias scoring towards the most significant synonym `(pants OR slacks)`. ++ +`as_same_term` is appropriatte when terms are true synonyms (television, tv). `pick_best` and `as_distinct_terms` are appropriatte when synonyms are expanding to hyponyms (q=jeans w/ jeans=>jeans,pants) and you want exact to come before parent and sibling concepts. See this http://opensourceconnections.com/blog/2017/11/21/solr-synonyms-mea-culpa/[blog article]. — End diff – Thanks @ctargett, this is one of those words I consistently misspell. Github spellchecking failed me, so I brought it down and double checked/fixed the spelling.
          Hide
          dsmiley David Smiley added a comment -

          Cool; I think this is ready to go, albeit a couple changes I noted while running tests & precommit.

          • calling toUpperCase requires Locale.ROOT. precommit caught this but so did simply running tests if you get "lucky" with an odd locale (I did).
          • => in asciidocs must be escaped with '\' (unless it's in a source block). precommit caught this.
          Show
          dsmiley David Smiley added a comment - Cool; I think this is ready to go, albeit a couple changes I noted while running tests & precommit. calling toUpperCase requires Locale.ROOT. precommit caught this but so did simply running tests if you get "lucky" with an odd locale (I did). => in asciidocs must be escaped with '\' (unless it's in a source block). precommit caught this.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 83753d0a2ae5bdd00649f43e355b5a43c6709917 in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=83753d0 ]

          SOLR-11662: synonymQueryStyle option for FieldType used by query parser

          Show
          jira-bot ASF subversion and git services added a comment - Commit 83753d0a2ae5bdd00649f43e355b5a43c6709917 in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=83753d0 ] SOLR-11662 : synonymQueryStyle option for FieldType used by query parser
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit ee896ec6ac103220c311421147d290124ab3df74 in lucene-solr's branch refs/heads/branch_7x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ee896ec ]

          SOLR-11662: synonymQueryStyle option for FieldType used by query parser

          (cherry picked from commit 83753d0)

          Show
          jira-bot ASF subversion and git services added a comment - Commit ee896ec6ac103220c311421147d290124ab3df74 in lucene-solr's branch refs/heads/branch_7x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ee896ec ] SOLR-11662 : synonymQueryStyle option for FieldType used by query parser (cherry picked from commit 83753d0)
          Hide
          dsmiley David Smiley added a comment -

          Thanks Doug!

          BTW I have a question on the practical use of this option. In the docs you mention the default as_same_term is good for real synonyms and that the otherS are good for hyponyms. Lets say the synonyms file has a mix of both (typical). It seems impossible to use both since the QueryBuilder passes no context other than the terms to build the query. Do you recommend different analyzer chains, one with regular synonyms and another with hypernyms via perhaps SOLR-11698? Of course that'd be less efficient than one query with the right type of query per synonym clause; but that's elusive without some custom query parser that detects the types and handles it (not leveraging QueryBuilder as it's not hackable).

          Show
          dsmiley David Smiley added a comment - Thanks Doug! BTW I have a question on the practical use of this option. In the docs you mention the default as_same_term is good for real synonyms and that the otherS are good for hyponyms. Lets say the synonyms file has a mix of both (typical). It seems impossible to use both since the QueryBuilder passes no context other than the terms to build the query. Do you recommend different analyzer chains, one with regular synonyms and another with hypernyms via perhaps SOLR-11698 ? Of course that'd be less efficient than one query with the right type of query per synonym clause; but that's elusive without some custom query parser that detects the types and handles it (not leveraging QueryBuilder as it's not hackable).
          Hide
          softwaredoug Doug Turnbull added a comment -

          Thanks for helping with the change David!

          I would probably personally do something like that. However, I tend to restructure most synonyms into a taxonomy. Many people aren't aware of hypernymy/hyponymy. It's not uncommon to see a synonym in an e-commerce clients, for example, that looks like `pants,khakis` with another line that's `pants,jeans` which of course creates an unintentional equivalence between jeans and khakis. Even when these are mixed in with true synonyms, I tend to restructure the whole thing as a taxonomy

          For example, some people avoid this for example at query time by expanding the query, and expecting the "as_distinct_terms" behavior, which biases towards exact match

          pants => jeans,pants,khakis
          jeans => jeans,pants
          khakis => jeans,khakis

          A search for pants here shows a mix of different kinds of pants (khakis and jeans roughly equal)
          A search for jeans puts jeans first (low doc freq), followed by various kinds of pants (high doc freq)
          A search for khakis puts khakis first, followed by various kinds of non-jean pants

          I tend to think of synonyms as hyponyms of a canonical name for an idea. So jeans for example, I might expand that to

          blue_jeans => blue_jeans,jeans,pants
          denim_jeans => denim_jeans,jeans,pants

          With multiple analyzer chains, I might recommend controlling how loose the search is with different analyzer chains. For example, one could see forcing a strong boost for conceptually similar items. Or limiting the semantic expansion so that blue_jeans, for example, only expands up to the jeans level.

          There's quite a lot of "it depends". The example above presupposes that pants have a higher doc freq than jeans, which may not be the case without a similar index-time expansion.

          Show
          softwaredoug Doug Turnbull added a comment - Thanks for helping with the change David! I would probably personally do something like that. However, I tend to restructure most synonyms into a taxonomy. Many people aren't aware of hypernymy/hyponymy. It's not uncommon to see a synonym in an e-commerce clients, for example, that looks like `pants,khakis` with another line that's `pants,jeans` which of course creates an unintentional equivalence between jeans and khakis. Even when these are mixed in with true synonyms, I tend to restructure the whole thing as a taxonomy For example, some people avoid this for example at query time by expanding the query, and expecting the "as_distinct_terms" behavior, which biases towards exact match pants => jeans,pants,khakis jeans => jeans,pants khakis => jeans,khakis A search for pants here shows a mix of different kinds of pants (khakis and jeans roughly equal) A search for jeans puts jeans first (low doc freq), followed by various kinds of pants (high doc freq) A search for khakis puts khakis first, followed by various kinds of non-jean pants I tend to think of synonyms as hyponyms of a canonical name for an idea. So jeans for example, I might expand that to blue_jeans => blue_jeans,jeans,pants denim_jeans => denim_jeans,jeans,pants With multiple analyzer chains, I might recommend controlling how loose the search is with different analyzer chains. For example, one could see forcing a strong boost for conceptually similar items. Or limiting the semantic expansion so that blue_jeans, for example, only expands up to the jeans level. There's quite a lot of "it depends". The example above presupposes that pants have a higher doc freq than jeans, which may not be the case without a similar index-time expansion.

            People

            • Assignee:
              dsmiley David Smiley
              Reporter:
              softwaredoug Doug Turnbull
            • Votes:
              4 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development