Where should the configuration of the highlighter we use for clustering come from?
We have all the code hooked in for it already, we're just ignoring the output.
To avoid confusion and questions along the lines of "why clusters don't match the (highlighted) documents I'm seeing", I'd suggest a slightly more elaborate scenario for the clustering highlighter configuration:
1. If main Solr highlighting is disabled, use the clustering component's highlighter settings.
2. If main Solr highlighting is enabled, use the main highlighter's configuration as the defaults and let the clustering-specific highlighter configuration override the defaults.
If we do it this way, we'll minimize the chances of users accidentally performing clustering on documents different (differently highlighted) than those they will see.
Would be great if, Carrot2 could also just use the analysis that Lucene/Solr produces, that way it would be much easier to configure stopwords, HTML stripping, etc.
This one would require some larger changes to Carrot2 internals. We do use Lucene infrastructure for preprocessing (currently for tokenization), but I can investigate if we can extend that further. A potential problem here is that very often the set of stopwords you use for document retrieval may not work equally well for clustering. I've filed a Carrot2-specific issue for it and will try to come up with something.