Solr
  1. Solr
  2. SOLR-3390

Highlighting issue with multi-word synonyms causes to highlight the wrong terms

    Details

      Description

      I am using solr 3.6 and when I have multi-words synonyms the highlighting results have the wrong word highlighted.

      If I have the below entry in the synonyms file:
      dns, domain name system

      If I index something like: "A sample dns entry explaining the details".

      Searching for "name" (without quotes) in the highlight results/snippets I get : "A sample dns <em>entry</em> explaining the details". (The token "entry" overlaps with the token "name" in the analysis.jsp)

      Searching for "system" (without quotes) in the highlight results/snippets I get : "A sample dns entry <em>explaining</em> the details". (The token "explaining" overlaps with the token "system" in the analysis.jsp)

      Here is my schema field Type:
      <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
      <charFilter class="solr.HTMLStripCharFilterFactory"/>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      </fieldType>

        Activity

        Hide
        Jan Høydahl added a comment -

        This is due to how the multi word synonym is inserted at the same position as the original term, and we have no way to tell whether you match the synonym or the original term since that information is lost after Analysis processing.

        This case would be solved by encoding term positions as a graph in such a way that the synonym node "domain name system" would occupy the same position as the original node "dns". This however would be a major change.

        Show
        Jan Høydahl added a comment - This is due to how the multi word synonym is inserted at the same position as the original term, and we have no way to tell whether you match the synonym or the original term since that information is lost after Analysis processing. This case would be solved by encoding term positions as a graph in such a way that the synonym node "domain name system" would occupy the same position as the original node "dns". This however would be a major change.
        Hide
        Michael McCandless added a comment -

        This is a hard problem to solve (indexing a graph).

        We've made some recent baby steps towards solving it, though: token streams can now include the PositionLengthAttribute, indicating how many positions an "alternate path" spans. SynonymFilter now sets this attribute only in certain cases (when the inserted syn is a single token). Still, we then drop this attr during indexing...

        Handling the case when the inserted syn is multi-word is tricky... I think dns would have to be changed to have posLen=3.

        Show
        Michael McCandless added a comment - This is a hard problem to solve (indexing a graph). We've made some recent baby steps towards solving it, though: token streams can now include the PositionLengthAttribute, indicating how many positions an "alternate path" spans. SynonymFilter now sets this attribute only in certain cases (when the inserted syn is a single token). Still, we then drop this attr during indexing... Handling the case when the inserted syn is multi-word is tricky... I think dns would have to be changed to have posLen=3.
        Hide
        Rahul Babulal added a comment -

        Thank you for the details.
        For now, I am setting the luceneMatchVersion to LUCENE_33. This sort of** fixes the highlighting issue. I am still testing to see if there are any other side effects of that. Do you guys now of any problems with setting the luceneMatchVersion to LUCENE_33.

        I will keep an eye on this issue.

        **The reason why I say it sort of works is that when I search "name", in my case dns, domain name search, it matches with the document which has "dns" in its index, that's because I have expand set to true.

        Show
        Rahul Babulal added a comment - Thank you for the details. For now, I am setting the luceneMatchVersion to LUCENE_33. This sort of** fixes the highlighting issue. I am still testing to see if there are any other side effects of that. Do you guys now of any problems with setting the luceneMatchVersion to LUCENE_33. I will keep an eye on this issue. **The reason why I say it sort of works is that when I search "name", in my case dns, domain name search, it matches with the document which has "dns" in its index, that's because I have expand set to true.
        Hide
        Okke Klein added a comment -

        Using multi word synonyms works a lot better in LUCENE_33 because of the way SlowSynonymFilter handles them. Is there a way to get the same behavior with the new filter?

        Show
        Okke Klein added a comment - Using multi word synonyms works a lot better in LUCENE_33 because of the way SlowSynonymFilter handles them. Is there a way to get the same behavior with the new filter?
        Hide
        Angelo Quaglia added a comment -

        There is also issue LUCENE-3668.
        Is <luceneMatchVersion>LUCENE_33</luceneMatchVersion> with Solr 3.6.1 an officially supported solution?
        Is that good for a production system?

        Show
        Angelo Quaglia added a comment - There is also issue LUCENE-3668 . Is <luceneMatchVersion>LUCENE_33</luceneMatchVersion> with Solr 3.6.1 an officially supported solution? Is that good for a production system?
        Hide
        Jonathan Cummins added a comment -

        I think you can fix it by using a "custom" synonym filter factory and without setting the "luceneMatchVersion" to "LUCENE_33" in the solrconfig.xml.

        You can just do something like:

        package your.package.name;

        public class CustomSynonymFilterFactory extends SynonymFilterFactory {

        @Override
        public void init(Map<String,String> args)

        { this.setLuceneMatchVersion(Version.LUCENE_33); super.init(args); }

        }

        And then, in your schema, you can do something like this:

        <filter class="your.package.name.CustomSynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        And that will let it use the "SlowSynonymFilter" from solr 3.3 for just the synonyms without changing the luceneMatchVersion in solrconfig.xml. It works basically by "tricking" the SynonymFilterFactory class into thinking the lucene version is 3.3 without it actually being 3.3.

        Hope that helps out!

        Show
        Jonathan Cummins added a comment - I think you can fix it by using a "custom" synonym filter factory and without setting the "luceneMatchVersion" to "LUCENE_33" in the solrconfig.xml. You can just do something like: package your.package.name; public class CustomSynonymFilterFactory extends SynonymFilterFactory { @Override public void init(Map<String,String> args) { this.setLuceneMatchVersion(Version.LUCENE_33); super.init(args); } } And then, in your schema, you can do something like this: <filter class="your.package.name.CustomSynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> And that will let it use the "SlowSynonymFilter" from solr 3.3 for just the synonyms without changing the luceneMatchVersion in solrconfig.xml. It works basically by "tricking" the SynonymFilterFactory class into thinking the lucene version is 3.3 without it actually being 3.3. Hope that helps out!

          People

          • Assignee:
            Unassigned
            Reporter:
            Rahul Babulal
          • Votes:
            3 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:

              Development