How would this work? E.g. many contribs depend on the common-analyzers module. Removing this dependency would almost certainly make the contribs non-functional.
The dependency is mostly bogus. Here is the contribs in question:
For example the ant IndexTask only depends on this so it can make this hashmap:
I think we could remove this, e.g. it already has reflection code to build the analyzer, if you supply "Xyz" why not just look for XyzAnalyzer as a fallback?
The lucli code has 'StandardAnalyzer' as a default: I think its best to not have a default analyzer at all. I would have fixed this already: but this contrib module has no tests! This makes it hard to want to get in there and clean up.
The misc code mostly supplies an Analyzer inside embedded tools that don't actually analyze anything. We could add a pkg-private NullAnalyzer that throws UOE on its tokenStream() <-- especially as they shouldnt be analyzing anything, so its reasonable to do?
The spellchecker code has a hardcoded WhitespaceAnalyzer... why is this? Seems like the whole spellchecking n-gramming is wrong anyway. Spellchecker uses a special form of n-gramming that depends upon the word length. Currently it does this in java code and indexes with WhitespaceAnalyzer (creating a lot of garbage in the process, e.g. lots of Field objects), but it seems this could all be cleaned up so that the spellchecker uses its own SpellCheckNgramAnalyzer, for better performance to boot.
The swing code defaults to a whitespaceanalyzer... in my opinion again its best to not have a default analyzer and make the user somehow specify one.
The wordnet code uses StandardAnalyzer for indexing the wordnet database. It also includes a very limited SynonymTokenFilter. In my opinion, now that we merged the SynonymTokenizer from solr that supports multi-word synonyms etc (which this wordnet module DOES NOT!), we should nuke this whole thing.
Instead, we should make the synonym-loading process more flexible, so that one can produce the SynonymMap from various formats (such as the existing Solr format, a relational database, wordnet's format, or openoffice thesaurus format, among others). We could have parsers for these various formats. This would allow us to have a much more powerful synonym capability, that works nicely regardless of format. We could then look at other improvements, such as allowing SynonymFilter to use a more ram-conscious datastructure for its Synonym mappings (e.g. FST), and everyone would see the benefits.
So hopefully this entire contrib could be deprecated.