[SOLR-1394] HTML stripper is splitting tokens - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4
Fix Version/s: 1.4
Component/s: Schema and Analysis
Labels:
None

Description

The Solr HTML stripper is replacing any removed HTML with whitespace. This is to keep offsets correct for highlighting.

However, as was already pointed out in ~~SOLR-42~~, this means that any token containing an HTML entity will be split into several tokens. That makes the HTML stripper completely unreliable for international text (and any text is potentially interantional).

The current code is actually deficient for BOTH highlighting and indexing, where the previous incarnation (that did not insert spaces) only had problems with highlighting.

The only workaround is to not use entities at all, which is impossible in some situations and inconvenient in most situations. If the client is required to transform entities before handing it to Solr, it might as well be required to also strip tags, and then the HTML stripper would not be needed at all.

Today, we have a better solution that can be used: offset correction. We can then avoid inserting extra whitespace, but still get correct offsets. The attached patch implements just that.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hex-entity.patch
16/Oct/09 22:27
2 kB
Anders Melchiorsen
SOLR-1394.patch
12/Oct/09 18:00
15 kB
Anders Melchiorsen
SOLR-1394.patch
29/Aug/09 22:18
7 kB
Anders Melchiorsen

Activity

People

Assignee:: Unassigned

Reporter:: Anders Melchiorsen

Votes:: 2 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 29/Aug/09 14:25

Updated:: 22/Sep/18 07:57

Resolved:: 16/Oct/09 22:22