Lucene - Core
  1. Lucene - Core
  2. LUCENE-5211

StopFilterFactory docs do not advertise/explain hte "format" option

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.2
    • Fix Version/s: 4.6, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      StopFilterFactory supports a "format" option for controlling wether "getWordSet" or "getSnowballWordSet" is used to parse the file, but this option is not advertised and people can be confused by looking at the example stopword files include in the releases (some of which are in the snoball format w/ "|" comments) and try to use them w/o explicitly specifying format="snowball" and silently get useless stopwords (that include the "| comments" as literal portions of hte stopwrds.

      we need to better document the use of "format" and consider updating all of the example stopword files we ship that are in the snowball format with a note about the need to use format="snowball" with those files.

      Initial Bug Report

      The StopFilterFactory builds a CharArraySet directly from the raw lines of the supplied words file. This causes a problem when using the stop word files supplied with the Solr/Lucene distribution. In particular, the comments in those files get added to the CharArraySet. A line like this...

      ceci | this

      Should result in the string "ceci" being added to the CharArraySet, but "ceci | this" is what actually gets added.

      Workaround: Remove all comments from stop word files you are using.

      Suggested fix: The StopFilterFactory should strip any comments, then strip trailing whitespace. The stop word files supplied with the distribution should be edited to conform to the supported comment format.

        Activity

        Hide
        Hoss Man added a comment -

        The StopFilterFactory supports two different "formats" of stop word files, the default format that has been supported since day #1 allows comments using "#", but more recently support was added for the "snowball" stopword format which is what is used in the stopwords_fr.txt file you seem to be refering to.

        the example usage of stopwords_fr.txt in solr explicitly configures the StopFilterFactory so that it knows the file is in the "smowball" format...

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" format="snowball" />
        

        So there doesn't seem to any functionaly bug here – just a documntation issue: when support was added for the "snowball" format, it appears that nothing was added to the class javadocs of hte factory to make this clear.

        If no one beats me to it, i'll clean this up next week.

        Show
        Hoss Man added a comment - The StopFilterFactory supports two different "formats" of stop word files, the default format that has been supported since day #1 allows comments using "#", but more recently support was added for the "snowball" stopword format which is what is used in the stopwords_fr.txt file you seem to be refering to. the example usage of stopwords_fr.txt in solr explicitly configures the StopFilterFactory so that it knows the file is in the "smowball" format... <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" format="snowball" /> So there doesn't seem to any functionaly bug here – just a documntation issue: when support was added for the "snowball" format, it appears that nothing was added to the class javadocs of hte factory to make this clear. If no one beats me to it, i'll clean this up next week.
        Hide
        Hayden Muhl added a comment -

        Ah, very good. I was a bit shocked when my French stop words weren't working. This seemed like too big of a functionality bug to be easily missed.

        Show
        Hayden Muhl added a comment - Ah, very good. I was a bit shocked when my French stop words weren't working. This seemed like too big of a functionality bug to be easily missed.
        Hide
        Hoss Man added a comment -

        two patches to make it easier to review...

        • patch that improves the StopFilterFactory javadocs to mention format, as well as improves the error handling of the format param (includes tests)
        • patch that updates all the snowball formatted files with a comment pointing out hteneed to use format="snowball" with those files.

        FWIW: the second patch was generated by the following perl script...

        #!/usr/bin/perl -i -n
        
        my $msg = q{NOTE: To use this file with StopFilterFactory, you must specify format="snowball"};
        print $_;
        if (m/This notice was added./) {
            print " |\n | $msg\n";
        }
        

        Run as...
        find -name *.txt | xargs grep -l "This notice was added" | xargs ~/tmp/lucene5211.note.in.snowballfiles.pl

        Show
        Hoss Man added a comment - two patches to make it easier to review... patch that improves the StopFilterFactory javadocs to mention format, as well as improves the error handling of the format param (includes tests) patch that updates all the snowball formatted files with a comment pointing out hteneed to use format="snowball" with those files. FWIW: the second patch was generated by the following perl script... #!/usr/bin/perl -i -n my $msg = q{NOTE: To use this file with StopFilterFactory, you must specify format= "snowball" }; print $_; if (m/This notice was added./) { print " |\n | $msg\n" ; } Run as... find -name *.txt | xargs grep -l "This notice was added" | xargs ~/tmp/lucene5211.note.in.snowballfiles.pl
        Hide
        ASF subversion and git services added a comment -

        Commit 1524809 from hossman@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1524809 ]

        LUCENE-5211: Better javadocs and error checking of 'format' option in StopFilterFactory, as well as comments in all snowball formated files about specifying format option

        Show
        ASF subversion and git services added a comment - Commit 1524809 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1524809 ] LUCENE-5211 : Better javadocs and error checking of 'format' option in StopFilterFactory, as well as comments in all snowball formated files about specifying format option
        Hide
        ASF subversion and git services added a comment -

        Commit 1524848 from hossman@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1524848 ]

        LUCENE-5211: Better javadocs and error checking of 'format' option in StopFilterFactory, as well as comments in all snowball formated files about specifying format option (merge r1524809)

        Show
        ASF subversion and git services added a comment - Commit 1524848 from hossman@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1524848 ] LUCENE-5211 : Better javadocs and error checking of 'format' option in StopFilterFactory, as well as comments in all snowball formated files about specifying format option (merge r1524809)

          People

          • Assignee:
            Hoss Man
            Reporter:
            Hayden Muhl
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development