Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
6.5, 7.0
-
None
-
None
-
None
-
New
Description
This problem is not present in WordDelimiterGraphFilter, but it is present in WordDelimiterFilter's interaction with HTMLStripCharFilter.
Test code:
public class TestTokenizationIssue2 { public static void main(String... args) throws IOException { HTMLStripCharFilter charFilter = new HTMLStripCharFilter(getText()); WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer(); whitespaceTokenizer.setReader(charFilter); // WordDelimiterGraphFilter wdgf = new WordDelimiterGraphFilter(whitespaceTokenizer, // WordDelimiterGraphFilter.GENERATE_WORD_PARTS, CharArraySet.EMPTY_SET); WordDelimiterFilter wdgf = new WordDelimiterFilter(whitespaceTokenizer, WordDelimiterFilter.GENERATE_WORD_PARTS, CharArraySet.EMPTY_SET); wdgf.reset(); while (wdgf.incrementToken()) { CharTermAttribute charTermAttribute = wdgf.getAttribute(CharTermAttribute.class); OffsetAttribute offsetAttribute = wdgf.getAttribute(OffsetAttribute.class); System.out.println(charTermAttribute.toString() + " - " + offsetAttribute.startOffset() + ',' + offsetAttribute.endOffset()); } } private static Reader getText() { return new StringReader("“Risk"); } }
The offsets produced by the WordDelimiterFilter are 1,10. With WordDelimiterGraphFilter the offsets produced are 0,10. It should be 0,10 as this is the original text:
“Risk
- and 1 is between the ampersand and hash.
Inside WordDelimiterFilter, I believe the conditional branch from "if (isSingleWord && startOffset <= savedEndOffset) " is invalid and it should always use the saved start and end offsets because it can't make the assertion that the iterator's current and end are reliable markers.
Attachments
Issue Links
- links to