Some of the analyzers allow for null to be specified for the stop word list. Others require an empty set/file/reader. Those deriving from StopawareAnalyzer allow null.
That is true - Stopawareanalyzer uses an empty set if you pass null.
I'd like to see the ability to use null to follow through the rest of the analyzers.
*Some of the analyzers are cluttered with stopword list processing.
The analyzers in this patch are rather a PoC than a complete list. Eventually we will have all analyzers with stopwords to extend StopawareAnalyzer that is also the reason why we have this class. This and some other issues aim to eventually have a consistent way of processing all this stuff related to stopwords. We will also remove all the setters and have Set<?> only ctors for consistency.
If not how about adding public static Set<?> getDefaultStopSet() to StopawareAnalyzer?
the problem is that it is static and it should be static. Thats why we define it in each analyzer that uses stopwords. I would like to have it generalized but this seems to be the ideal solution. We could have something like a getDefaultStopSet(Class<? extends StopawareAnalyzer>) but I like the expressiveness of getDefaultStopSet() way better though.
How about splitting out the stop words to their own class?
What do you mean by that? can you elaborate?
There are some TODOs in the code to make this or that private or final. If this is going to wait for 3.1 shouldn't they change?
The should actually go away but I kept them in there because they are somewhat unrelated to this particular issue. Once this is in we will work on removing the deprecated stuff and make analyzers final (at least in contrib).
In WordListLoader the return types are not Set or Map, but HashSet and HashMap. What's up with that? Should anyone care what the particular implementation is?
that is one thing I hate about WordListLoader. +1 towards Uwe working on them!
I'm trying to figure out a way to specify a tokenizer/filter chain. (I've been trying to figure it out for a while, but not with much effort or success).
This has been discussed already and we haven't had much of a success though. I can not remember the issue (robert can you remember the factory issue?) but it was basically based on a factory pattern. This would also be my approach to it. That way we could get rid of almost every analyzer. I use such a pattern myself which works quite well.
DM, I think we can have both? A method to get the default stopword list, but then they also happen to be in text files too?
+1 for having those words in files. Nevertheless we will have a default stopword list though.