Naming will require some thought, though - I don't like EnglishTokenizer or EuropeanTokenizer - both seem to exclude valid constituencies.
What valid constituencies do you refer to?
Well, we can't call it English/EuropeanTokenizer (maybe EnglishAndEuropeanAnalyzer? seems too long), and calling it either only English or only European seems to leave the other out. Americans, e.g., don't consider themselves European, maybe not even linguistically (however incorrect that might be).
In general the acronym,company,possessive stuff here are very english/euro-specific.
Right, I agree. I'm just looking for a name that covers the languages of interest unambiguously. WesternTokenizer? (but "I live east of the Rockies - can I use WesternTokenizer?"...) Maybe EuropeanLanguagesTokenizer? The difficulty as I see it is the messy intersection between political, geographic, and linguistic boundaries.
Bugs in JIRA get opened if it doesn't do this stuff right on english, but it doesn't even work at all for a lot of languages. Personally I think its great to rip this stuff out of what should be a "default" language-independent tokenizer based on standards (StandardTokenizer), and put it into the language-specific package that it belongs. Otherwise we have to worry about these sort of things overriding and screwing up UAX#29 rules for words in real languages.
I assume you don't mean to say that English and European languages are not real languages .
What do you think about adding tailorings for Thai, Lao, Myanmar, Chinese, and Japanese? (Are there others like these that aren't well served by UAX#29 without customizations?)
It gets a little tricky: we should be careful about how we interpret what is "reasonable" for a language-independent default tokenizer. I think its "enough" to output the best indexing unit that is possible and relatively unambiguous to identify. I think this is a shortcut we can make, because we are trying to tokenize things for information retrieval, not for other purposes. The approach for Lao, Myanmar, Khmer, CJK, etc in ICUTokenizer is to just output syllables as indexing unit, since words are ambiguous. Thai is based on words, not syllables, in ICUTokenizer, which is inconsistent from this, but we get this for free, so its just a laziness thing.
I think that StandardTokenizer should contain tailorings for CJK, Thai, Lao, Myanmar, and Khmer, then - it should be able to do reasonable things for all languages/scripts, to the greatest extent possible.
The English/European tokenizer can then extend StandardTokenizer (conceptually, not in the Java sense).
I'm thinking of leaving UAX29Tokenizer as-is, and adding tailorings as separate classes - what do you think?
Well, either way I again strongly feel this logic should be tied into "Standard" tokenizer, so that it has better unicode behavior. I think it makes sense for us to have a reasonable, language-independent, standards-based tokenizer that works well for most languages. I think it also makes sense to have English/Euro-centric stuff thats language-specific, sitting in the analysis.en package just like we
do with other languages.
I agree that stuff like giving "O'Reilly's" the <APOSTROPHE> type, to enable so-called StandardFilter to strip out the trailing /'s/, is stupid for all non-English languages.
It might be confusing, though, for a (e.g.) Greek user to have to go look at the analysis.en package to get reasonable performance for her language.
Maybe an EnglishTokenizer, and separately a EuropeanAnalyzer? Is that what you've been driving at all along??? (Silly me.... Sigh.)