[LUCENE-5734] HTMLStripCharFilter end offset should be left of closing tags - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

Consider this simple input:

<em>hello</em>

to be analyzed by HTMLStripCharFilter and WhitespaceTokenizer.
You get back one token for "hello". Good. The start offset of this token is at the position of 'h' – good. But the end offset is surprisingly plus one to the adjacent </em>. I argue that it should be plus one to the last character of the token (following 'o').

FYI it behaves as I expect if after hello is an XML entity such as in this example:

hello&nbsp;

The end offset immediately follows the 'o'.

Attachments

Issue Links

relates to

LUCENE-6595 CharFilter offsets correction is wonky

Open

Activity

People

Assignee:: Unassigned

Reporter:: David Smiley

Votes:: 2 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 04/Jun/14 16:20

Updated:: 28/Aug/22 14:09