Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
-
New
Description
Consider this simple input:
<em>hello</em>
to be analyzed by HTMLStripCharFilter and WhitespaceTokenizer.
You get back one token for "hello". Good. The start offset of this token is at the position of 'h' – good. But the end offset is surprisingly plus one to the adjacent </em>. I argue that it should be plus one to the last character of the token (following 'o').
FYI it behaves as I expect if after hello is an XML entity such as in this example:
hello
The end offset immediately follows the 'o'.
Attachments
Issue Links
- relates to
-
LUCENE-6595 CharFilter offsets correction is wonky
- Open