Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-42

Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 3.6, 4.0-ALPHA
    • highlighter
    • None

    Description

      Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).

      Example title field:

      <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia

      Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).

      Response from Yonik on the solr-user mailing-list:

      HTMLStripWhitespaceTokenizerFactory works in two phases...
      HTMLStripReader removes the HTML and passes the result to
      WhitespaceTokenizer... at that point, Tokens are generated, but the
      offsets will correspond to the text after HTML removal, not before.

      I did it this way so that HTMLStripReader could go before any
      tokenizer (like StandardTokenizer).

      Can you open a JIRA bug for this? The fix would be a special version
      of HTMLStripReader integrated with a WhitespaceTokenizer to keep
      offsets correct.

      Attachments

        1. htmlStripReaderTest.html
          13 kB
          Grant Ingersoll
        2. HTMLStripReaderTest.java
          2 kB
          Grant Ingersoll
        3. HtmlStripReaderTestXmlProcessing.patch
          2 kB
          Chris Harris
        4. HtmlStripReaderTestXmlProcessing.patch
          1 kB
          Chris Harris
        5. SOLR-42.patch
          18 kB
          Grant Ingersoll
        6. SOLR-42.patch
          3 kB
          Grant Ingersoll
        7. SOLR-42.patch
          15 kB
          Grant Ingersoll
        8. SOLR-42.patch
          5 kB
          Grant Ingersoll
        9. TokenPrinter.java
          2 kB
          Chris Harris

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            sarowe Steven Rowe
            amayingenta Andrew May
            Votes:
            4 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment