[SOLR-42] Highlighting problems with HTMLStripWhitespaceTokenizerFactory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.6, 4.0-ALPHA
Component/s: highlighter
Labels:
None

Description

Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).

Example title field:

40Ar/39Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia

Searching for title:fabrics with highlighting on, the highlighted version has the tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).

Response from Yonik on the solr-user mailing-list:

HTMLStripWhitespaceTokenizerFactory works in two phases...
HTMLStripReader removes the HTML and passes the result to
WhitespaceTokenizer... at that point, Tokens are generated, but the
offsets will correspond to the text after HTML removal, not before.

I did it this way so that HTMLStripReader could go before any
tokenizer (like StandardTokenizer).

Can you open a JIRA bug for this? The fix would be a special version
of HTMLStripReader integrated with a WhitespaceTokenizer to keep
offsets correct.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

htmlStripReaderTest.html
04/Jan/08 23:31
13 kB
Grant Ingersoll
HTMLStripReaderTest.java
03/Jan/08 14:15
2 kB
Grant Ingersoll
HtmlStripReaderTestXmlProcessing.patch
19/Feb/08 17:49
2 kB
Chris Harris
HtmlStripReaderTestXmlProcessing.patch
14/Feb/08 18:37
1 kB
Chris Harris
SOLR-42.patch
09/Jan/08 12:50
18 kB
Grant Ingersoll
SOLR-42.patch
07/Jan/08 19:21
3 kB
Grant Ingersoll
SOLR-42.patch
05/Jan/08 13:55
15 kB
Grant Ingersoll
SOLR-42.patch
03/Jan/08 16:20
5 kB
Grant Ingersoll
TokenPrinter.java
14/Feb/08 18:37
2 kB
Chris Harris

Issue Links

is duplicated by

SOLR-57 Highlighter does not work with HTML content that's passed through HTMLStrip*Tokenizer

Closed

is superceded by

LUCENE-3690 JFlex-based HTMLStripCharFilter replacement

Closed

Activity

People

Assignee:: Steven Rowe

Reporter:: Andrew May

Votes:: 4 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 28/Jul/06 20:43

Updated:: 10/May/13 10:41

Resolved:: 24/Jan/12 15:55