Solr
  1. Solr
  2. SOLR-2450

Carrot2 clustering should use both its own and Solr's stop words

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2, 4.0-ALPHA
    • Component/s: contrib - Clustering
    • Labels:
      None

      Description

      While using only Solr's stop words for clustering isn't a good idea (compared to indexing, clustering needs more aggressive stop word removal to get reasonable cluster labels), it would be good if Carrot2 used both its own and Solr's stop words.

      I'm not sure what the best way to implement this would be though. My first thought was to simply load stopwords.txt from Solr config dir and merge them with Carrot2's. But then, maybe a better approach would be to get the stop words from the StopFilter being used? Ideally, we should also consider the per-field stop filters configured on the fields used for clustering.

      1. SOLR-2450.patch
        14 kB
        Stanislaw Osinski

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          people can name their stopwords file anything they want – and that's just with the default StopFilterFactory, it doesn't even account for the possibility of other filter factories that implement similar functionality.

          one thing you could probably do, assuming you wanted to stick with just worrying about the stock StopFilterFactory, is to query the IndexSchema for the analyzer of the fieldTypes you are interested in (presumably via some configured list of field names) and then test those analyzers to see if they are analysis chain based, and if they are look to see if they contain the StopFilterFactory, and if they do, THEN you can get the list of words (or at the very least: the file the words came from)

          AnalysisRequestHandlerBase should have an example of walking an analysis chain to see what factories are in it.

          Show
          Hoss Man added a comment - people can name their stopwords file anything they want – and that's just with the default StopFilterFactory, it doesn't even account for the possibility of other filter factories that implement similar functionality. one thing you could probably do, assuming you wanted to stick with just worrying about the stock StopFilterFactory, is to query the IndexSchema for the analyzer of the fieldTypes you are interested in (presumably via some configured list of field names) and then test those analyzers to see if they are analysis chain based, and if they are look to see if they contain the StopFilterFactory, and if they do, THEN you can get the list of words (or at the very least: the file the words came from) AnalysisRequestHandlerBase should have an example of walking an analysis chain to see what factories are in it.
          Hide
          Robert Muir added a comment -

          just to extend on hossman's point, there are a variety of ways someone could be setting up stopwords:

          • With StopWordFilterFactory
          • by configuring their analyzer with <analyzer class=....> and the Analyzer actually uses a stopword list internally (in this case, if its a supplied lucene analyzer you can check: if (instanceof StopwordAnalyzerBase) ... and then invoke StopwordAnalyzerBase.getStopwordSet() on the analyzer, but its true someone could make a custom one that uses stopwords, but extends Analyzer directly).
          • by using stopwords-like stuff such as CommonGramsFilter, that still have the concept of stopwords but just work differently.
          • by using a custom filter/analyzer of their own that acts like stopfilter.
          Show
          Robert Muir added a comment - just to extend on hossman's point, there are a variety of ways someone could be setting up stopwords: With StopWordFilterFactory by configuring their analyzer with <analyzer class=....> and the Analyzer actually uses a stopword list internally (in this case, if its a supplied lucene analyzer you can check: if (instanceof StopwordAnalyzerBase) ... and then invoke StopwordAnalyzerBase.getStopwordSet() on the analyzer, but its true someone could make a custom one that uses stopwords, but extends Analyzer directly). by using stopwords-like stuff such as CommonGramsFilter, that still have the concept of stopwords but just work differently. by using a custom filter/analyzer of their own that acts like stopfilter.
          Hide
          Stanislaw Osinski added a comment -

          Patch for the use of stop words from the field's StopWordFilterFactory and CommonGramsFilterFactory in addition to Carrot2's built-in stop words.

          Requires the SOLR-2448 and SOLR-2449 patches applied.

          Show
          Stanislaw Osinski added a comment - Patch for the use of stop words from the field's StopWordFilterFactory and CommonGramsFilterFactory in addition to Carrot2's built-in stop words. Requires the SOLR-2448 and SOLR-2449 patches applied.
          Hide
          Stanislaw Osinski added a comment -

          Committed to trunk and branch_3x.

          Show
          Stanislaw Osinski added a comment - Committed to trunk and branch_3x.
          Hide
          Robert Muir added a comment -

          Bulk close for 3.2

          Show
          Robert Muir added a comment - Bulk close for 3.2

            People

            • Assignee:
              Stanislaw Osinski
              Reporter:
              Stanislaw Osinski
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development