Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-57

Highlighter does not work with HTML content that's passed through HTMLStrip*Tokenizer

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: search
    • Labels:
      None
    • Environment:

      Red Hat Linux 9, Tomcat 5.5.20

      Description

      I have a fieldtype with the following definition:
      <fieldtype name="htmltext" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
      <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
      <filter class="solr.StandardFilterFactory" />
      <filter class="solr.LowerCaseFilterFactory" />
      <filter class="solr.StopFilterFactory" />
      <filter class="solr.EnglishPorterFilterFactory" />
      <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
      <filter class="solr.ISOLatin1AccentFilterFactory" />
      </analyzer>
      </fieldtype>

      When fields with that definition are included in the list of fields to be highlighted, the highlighted term is always offset because it does not take into account the HTML tags before it, so you end up with something like this for the highlighted snipplet:

      Does your comptuer meet the <a href="http:/<em>/www.example</em>.com/system_requirements.shtml">minimum system requirements</a>?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                hya Ho Yin Au
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: