after investigating: its difficult to prevent negative offsets, even after fixing term vectors writer (LUCENE-3739)
At first i tried a simple assert in BaseTokenStreamTestCase:
assertTrue("offsets must not go backwards", offsetAtt.startOffset() >= lastStartOffset);
lastStartOffset = offsetAtt.startOffset();
Then these analyzers failed:
- MockCharFilter itself had a bug, but thats easy to fix (
- synonymsfilter failed sometimes (
LUCENE-3742) because it wrote zeros for offsets in situations like "a -> b c"
- (edge)ngramtokenizers failed, because ngrams(1,2) of "ABCD" are not A, AB, B, BC, C, CD, D but instead A, B, C, D, AB, BC, CD, ...
- (edge)ngramfilters failed for similar reasons.
- worddelimiterfilter failed, because it doesnt break "AB" into A, AB, B but instead A, B, AB
- trimfilter failed when 'offsets changing' is enabled, because if you have " rob", "robert" as synonyms then it trims the first, and the second offsets "go backwards"
These are all bugs.
In general I think offsets after being set should not be changed, because filters don't have access to any charfilters
offset correction (correctOffset()) anyway, so they shouldnt be mucking offsets.
So really: only the creator of tokens should make the offsets. And if thats a filter, it should be a standard way,
only inherited from existing offsets and not 'offset mathematics' and not A, AB, B in some places and A, B, AB in others.
Really i think we need to step it up if we want highlighting to be first-class citizen in lucene, nothing checks the offsets anyhwere at all,
even to check/assert if they are negative, and there are little tests... all we have is some newish stuff in basetokenstreamtestcase and
a few trivial test cases.
On the other hand, for example, position increment's impl actually throws exception if you give it something like a negative number...