here is a short explanation of what i figure might be the controversial part: adding all the language-specific analyzers:
I think its too difficult for a non-english user to use lucene.
Let's take the romanian case, sure its supported by SnowballAnalyzer, but:
- where are the stopwords? if the user is smart enough they can google this and find savoy's list... but it contains some stray nouns that should not be in there, and will they get the encoding correct?
- for some languages: french, dutch, turkish: we already want to do something different already. For french we need the elision filter to tokenize correctly, for dutch, the special dictionary-based exclusions (I have been told by some any stemmer that does not handle fiets correct is useless), for turkish we need the special lowercasing.
- for other languages: german, swedish, ... i think we REALLY want to implement decompounding support in the future. For german at least, there is a public domain wordlist just itching to be used for this.
- oh yeah, and all the javadocs are in english, so writing your own analyzer is another barrier to entry.
So I think instead its best to have a "recommended default" organized by language, preferably one we have relevance tested / or is already published. many of the existing snowball stemmers have published relevance results available already, thus my bias towards them. Sure it won't meet everyones needs, and users should still think about using them as a template, but I think digging up your own stoplist / writing your own analyzer, figuring out your language support is really buried in snowball, combined with documentation not in your native tongue, i think this adds up to a barrier to entry that is simply too high.