Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-2845

Adding extra highlighting term to a synonym

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.4
    • 3.4
    • highlighter
    • None

    Description

      I notice a strange highlighting behaviour while highlighting a synonym term. It is in 3.4.0 release. This is working fine in 1.4.1. Using solr example core, here are the steps to reproduce the problem.

      1) In schema.xml, change text_general fieldtype definition to use synonym filter at index time and remove the filter from query analysis.

      <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
      
          <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
          <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
        <analyzer type="query">
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
          <!-- <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> -->
      
          <filter class="solr.LowerCaseFilterFactory"/>
      
        </analyzer>
      </fieldType>
      

      2) Define a new field 'test_field1'.

        <field name="test_field1" type="text_general" indexed="true" stored="true" multiValued="true"/>
      

      3) Copy this to 'text' field.

        <copyField source="test_field1" dest="text"/>
      

      4) In exampledocs/ipod_video.xml, add a new field to the doc.

        <field name="test_field1">Heart Failure</field>
      

      5) In solr/conf/index_synonyms.txt:, add the following line (all in one line).

      heart failure, failure\, heart, cardiac failure, cardiac insufficiency, failure heart, failure\, cardiac, heart failure (nos), insufficiency cardiac, insufficiency\, cardiac, hf - heart failure
      

      6) Reindex exampledocs/*xml files and run the following URL.

      http://localhost:8983/solr/select?q=heart&indent=on&hl=on&hl.fl=*

      This is what I get from highlighting tag.

        <lst name="highlighting">
          <lst name="MA147LL/A">
            <arr name="test_field1">
              <str>&lt;em&gt;Heart&lt;/em&gt;&lt;em&gt;Heart Failure&lt;/em&gt;</str>
            </arr>
          </lst>
        </lst>
      

      The actual value of the field is Heart Failure. It is changed to HeartHeart Failure.

      Apparently the synonym entries has something to do with the problem. The above synonym terms are the minimum extraction from a larger line to reproduce the problem. Notice that there is a hyphen in the last term. If I remove the hyphen, it works, even with larger line of entries. Keeping the hyphen, and removing insufficiency\, cardiac, also works. So the length of the line and hyphen both seem at play here.

      Using large and complicated synonyms is very important to our application. 3.4 release has announced some major improvements to memory foot print and performance for synonym filter. For this reason we are eager to move to 3.4.0, but this problem is a show stopper for us. I will appreciate any suggestions for a work around or a quick fix to the problem.

      Regards,
      -Ajay

      Attachments

        Activity

          People

            Unassigned Unassigned
            akanduru Ajay Kanduru
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: