DM, I really appreciate your review. You have brought up some good ideas that I haven't yet thought about.
All I see is a bit of JavaDoc and an extraneous unused variable (ICUTokenizer: private PositionIncrementAttribute posIncAtt
Yeah there are some TODOs, and cleanup on the tokenstreams, and the API in general. its not easy to customize the way its supposed to be: where you as a user can actually supply BreakIterator impls to the tokenizer and say "use these rules/dictionary/whatever for tokenizing XYZ script only".
I'm wondering whether it would make sense to have multiple representations of a token with the same position in the index. Specifically, transliterations and case-folding. That is, the one is a "synonym" for the other. Is that possible and does it make sense? I'm imagining a use case where a end user enters for a search request a Latin script transliteration of Greek "uios" but might also enter "υιος".
Yeah this is something to consider. I don't think it makes sense for the case folding filter, but maybe for the transform filter? will have to think about it.
There's use cases here like what you mentioned, also real-world ones like invoking Serbian-Latin or something, where you want users to search in either writing system and there actually is a clearly defined transformation.
I guess on the other hand, you could always use a separate field (with different analysis/transforms) for each and search both.
The other question on my mind is that given a text of German, Greek and Hebrew (three distinct scripts) does it make sense to apply stop words to them based on script? And should stop words be normalized on load with the ICUNormalizationFilter? Or is it a given that they work as is?
You could put them all in one list with regular stopfilter now. They won't clash since they are different unicode Strings. Obviously I would normalize this list with the same stuff (normalization form/case folding/whatever) that your analyzer users.
I don't put any stopwords in this, because thats language dependent, trying to stick with language-independent (either stuff that applies to unicode as a whole, or specific writing systems, which can be accurately detected).
Can/How does all this integrate with stemmers?
Right this is just supposed to be what "StandardTokenizer"-type stuff does, and you would add stemming on top of it. The idea is you would use this even if you think you only have english text, maybe then applying your porter english stemmer. But if it happens to stumble upon some CJK or Thai or something along the way, everything will be ok.
In all honesty, I probably put 90% of the work into the Khmer, Myanmar, Lao, etc cases. Having good tokenization I think makes a usable search engine, for a lot of languages stemming is only a bonus.
However, one thing it also does is put the script value in the flags for each token. This can work pretty well: if its Greek script, its probably Greek language, but if its Hebrew script, well it could be Yiddish too. If its Latin script, could be english, german, etc. Its ended only to make life easier since the information is already available... but I don't know yet how to make use of it in a nice way.
Again, many thanks! (Btw, special thanks for this working with 2.9 and Java 1.4!)
Yeah i haven't updated it to java 5/Lucene 3.x yet, started working it, but kinda forgot about that so far. I guess this is a good thing, so you can play with it if you want.