[LUCENE-7795] WordDelimiterFilter produces invalid offsets in single word case - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 6.5, 7.0
Fix Version/s: None
Component/s: None
Labels:
None

Lucene Fields:

New

Description

This problem is not present in WordDelimiterGraphFilter, but it is present in WordDelimiterFilter's interaction with HTMLStripCharFilter.

Test code:

public class TestTokenizationIssue2 {
    public static void main(String... args) throws IOException {
        HTMLStripCharFilter charFilter = new HTMLStripCharFilter(getText());
        WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
        whitespaceTokenizer.setReader(charFilter);
       // WordDelimiterGraphFilter wdgf = new WordDelimiterGraphFilter(whitespaceTokenizer,
        //       WordDelimiterGraphFilter.GENERATE_WORD_PARTS, CharArraySet.EMPTY_SET);

        WordDelimiterFilter wdgf = new WordDelimiterFilter(whitespaceTokenizer,
               WordDelimiterFilter.GENERATE_WORD_PARTS, CharArraySet.EMPTY_SET);
        wdgf.reset();

        while (wdgf.incrementToken()) {
            CharTermAttribute charTermAttribute = wdgf.getAttribute(CharTermAttribute.class);
            OffsetAttribute offsetAttribute = wdgf.getAttribute(OffsetAttribute.class);

            System.out.println(charTermAttribute.toString() + " - " + offsetAttribute.startOffset() + ',' + offsetAttribute.endOffset());
        }
    }

    private static Reader getText() {
        return new StringReader("&#x93;Risk");
    }
}

The offsets produced by the WordDelimiterFilter are 1,10. With WordDelimiterGraphFilter the offsets produced are 0,10. It should be 0,10 as this is the original text:

&#x93;Risk

- and 1 is between the ampersand and hash.

Inside WordDelimiterFilter, I believe the conditional branch from "if (isSingleWord && startOffset <= savedEndOffset) " is invalid and it should always use the saved start and end offsets because it can't make the assertion that the iterator's current and end are reliable markers.

Attachments

Issue Links

links to

GitHub Pull Request #194

Activity

People

Assignee:: Unassigned

Reporter:: Michael Braun

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Apr/17 21:15

Updated:: 28/Aug/22 15:14