[LUCENE-3717] Add fake charfilter to BaseTokenStreamTestCase to find offsets bugs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.6, 4.0-ALPHA
Component/s: None
Labels:
None

Lucene Fields:

New

Description

Recently lots of issues have been fixed about broken offsets, but it would be nice to improve the
test coverage and test that they work across the board (especially with charfilters).

in BaseTokenStreamTestCase.checkRandomData, we can sometimes pass the analyzer a reader wrapped
in a "MockCharFilter" (the one in the patch sometimes doubles characters). If the analyzer does
not call correctOffsets or does incorrect "offset math" (~~LUCENE-3642~~, etc) then eventually
this will create offsets and the test will fail.

Other than tests bugs, this found 2 real bugs: ICUTokenizer did not call correctOffset() in its end(),
and ThaiWordFilter did incorrect offset math.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-3717_more.patch
23/Jan/12 03:52
39 kB
Robert Muir
LUCENE-3717_ngram.patch
24/Jan/12 09:48
22 kB
Robert Muir
LUCENE-3717.patch
22/Jan/12 23:30
14 kB
Robert Muir

Activity

People

Assignee:: Unassigned

Reporter:: Robert Muir

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 22/Jan/12 23:29

Updated:: 28/Aug/22 13:06

Resolved:: 24/Jan/12 10:52