DM, thanks, I see exactly where you are coming from.
I see your point: previously it was much easier to take something like SimpleAnalyzer and 'adapt' it to a given language based on things like unicode properties.
In fact thats exactly what we did in the cases here (Arabic, Persian, Hindi, etc)
But now we can actually tokenize "correctly" for more languages with jflex, thanks to its improved unicode support, and its superior to these previous hacks
to try to answer some of your questions (all my opinion):
Is there a point to having SimpleAnalyzer
I guess so, a lot of people can use this if they have english-only content and are probably happy with discard numbers etc... its not a big loss to me if it goes though.
Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?)
In trunk (4.x codeline) there is no core, contrib, or solr for analyzer components any more. they are all combined into modules/analysis.
In branch_3x (3.x codeline) we did not make this rather disruptive refactor: there UAX29Tokenizer is in fact in lucene core.
Would there be a way to plugin ICUTokenizer as a replacement for UAX29Tokenizer into StandardTokenizer, such that all Analyzers using StandardTokenizer would get the alternate implementation?
Personally, i would prefer if we move towards a factory model where things like these supplied "language analyzers" are actually xml/json/properties snippets.
In other words, they are just example configurations that builds your analyzer, like solr does.
This is nice, because then you dont have to write code to easily customize how your analyzer works.
I think we have been making slow steps towards this, just doing basic things like moving stopwords lists to .txt files.
But i think the next step would be
LUCENE-2510, where we have factories/config attribute parsers for all these analysis components already written.
Then we could have support for declarative analyzer specification via xml/json/.properties/whatever, and move all these Analyzers to that.
I still think you should be able to code up your own analyzer, but in my opinion this is much easier and preferred for the ones we supply.
Also i think this would solve a lot of analyzer-backwards-compat problems, because then our supplied analyzers are really just configuration file examples,
and we can change our examples however we want... someone can use their old config file (and hopefully old analysis module jar file!) to guarantee
the exact same behavior if they want.
Finally, most of the benefits of ICUTokenizer are actually in the UAX29 support... the tokenizers are pretty close with some minor differences:
- the jflex-based implementation is faster, and better in my opinion.
- the ICU-based implementation allows tailoring, and supplies tailored tokenization for several complex scripts (jflex doesnt have this... yet)
- the ICU-based implementation works with all of unicode, at the moment jflex is limited to the basic multilingual plane.
In my opinion the last 2 points will probably be eventually resolved... i could see our ICUTokenizer possibly becoming obselete down the road
by some better jflex support, though it would have to probably have hooks into ICU for the complex script support (so we get it for free from ICU)