ThaiWordFilter is an offender in TestRandomChains because it creates positions and updates offsets.
Fix analyzer bugs documented in TestRandomChains
SmartChineseAnalyzer got wrong matched offset
I think this should be a tokenizer.
Good point, I'll update the patch to create a ThaiTokenizer so that we can just completely deprecate this filter.
tokenizing from a breakiterator can get a little tricky.
we had some support for this (it should be re-reviewed) in the initial kuromoji integration (SegmentingTokenizerBase.java and its test)
But we ended out adding a streaming viterbi search so we didnt need it anymore:
I cut this over to ThaiTokenizer with that base class restored from Kuromoji. The tokenizer itself is simpler now. I think we can use the same approach with SmartChinese.
updated patch: I also cut over smartchinese to use this same approach while we are here.
+1, patch lgtm
Is fixing Smart Chinese to not emit punctuation as simple as hardcoding the list of punctuation characters and skipping them in something like incrementWord()?
Its even simpler than that. But i wanted to do that in a followup issue. 4.8 is a good time to fix it, as its easy with this tokenizer!
I really like the base class! The patch LGTM +1 to commit
Commit 1579846 from Robert Muir in branch 'dev/trunk'
[ https://svn.apache.org/r1579846 ]
LUCENE-4984: Fix ThaiWordFilter, smartcn WordTokenFilter
Commit 1579853 from Robert Muir in branch 'dev/trunk'
[ https://svn.apache.org/r1579853 ]
LUCENE-4984: actually pass down the AttributeFactory to superclass
Commit 1579855 from Robert Muir in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1579855 ]
Close issue after release of 4.8.0