Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.1
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Currently Solr makes it easy to use english stopwords for StopFilter or CommonGramsFilter.
      Recently in lucene, we added stopwords lists (mostly, but not all from snowball) to all the language analyzers.

      So it would be nice if a user can easily specify that they want to use a french stopword list, and use it for StopFilter or CommonGrams.

      The ones from snowball, are however formatted in a different manner than the others (although in Lucene we have parsers to deal with this).
      Additionally, we abstract this from Lucene users by adding a static getDefaultStopSet to all analyzers.

      There are two approaches, the first one I think I prefer the most, but I'm not sure it matters as long as we have good examples (maybe a foreign language example schema?)

      1. The user would specify something like:

      <filter class="solr.StopFilterFactory" fromAnalyzer="org.apache.lucene.analysis.FrenchAnalyzer" .../>
      This would just grab the CharArraySet from the FrenchAnalyzer's getDefaultStopSet method, who cares where it comes from or how its loaded.

      2. We add support for snowball-formatted stopwords lists, and the user could something like:

      <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/snowball/french_stop.txt" format="snowball" ... />
      The disadvantage to this is they have to know where the list is, what format its in, etc. For example: snowball doesn't provide Romanian or Turkish
      stopword lists to go along with their stemmers, so we had to add our own.

      Let me know what you guys think, and I will create a patch.

      1. SOLR-1860.patch
        9 kB
        Robert Muir
      2. SOLR-1860.patch
        6 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        A third idea from Hoss Man:

        We should make it easy to edit these lists like english.
        So an idea is to create an intl/ folder or similar under the example with stopwords_fr.txt, stopwords_de.txt
        Additionally we could have a schema-intl.xml with example types 'text_fr', 'text_de', etc setup for various languages.
        I like this idea best.

        Show
        Robert Muir added a comment - A third idea from Hoss Man: We should make it easy to edit these lists like english. So an idea is to create an intl/ folder or similar under the example with stopwords_fr.txt, stopwords_de.txt Additionally we could have a schema-intl.xml with example types 'text_fr', 'text_de', etc setup for various languages. I like this idea best.
        Hide
        Yonik Seeley added a comment -

        How many languages are we talking?

        I like the idea of an export - it's transparent and neatly handles back compat concerns.
        To avoid clutter, putting them all in a separate directory seems like a good idea:
        /conf/stopwords/stopwords_en.txt
        /conf/stopwords/stopwords_fr.txt

        Or will there be other per-language files? If so, maybe
        /conf/lang/stopwords_en.txt
        /conf/lang/protected_en.txt
        /conf/lang/synonyms_en.txt

        As far as file format: I think we sould also support the snowball stopword format.

        Not sure at this point if it makes more sense trying to put a text_fr, etc, in the normal schema.xml or in a separate schema_intl.xml. Partly depends on the number of text_<lang> types and resource usage I guess... need to consider things like core load time, etc.
        We may want to think about lazy-loaded analyzers (but that could be another ball of wax since misconfigurations don't immediately fail).

        Show
        Yonik Seeley added a comment - How many languages are we talking? I like the idea of an export - it's transparent and neatly handles back compat concerns. To avoid clutter, putting them all in a separate directory seems like a good idea: /conf/stopwords/stopwords_en.txt /conf/stopwords/stopwords_fr.txt Or will there be other per-language files? If so, maybe /conf/lang/stopwords_en.txt /conf/lang/protected_en.txt /conf/lang/synonyms_en.txt As far as file format: I think we sould also support the snowball stopword format. Not sure at this point if it makes more sense trying to put a text_fr, etc, in the normal schema.xml or in a separate schema_intl.xml. Partly depends on the number of text_<lang> types and resource usage I guess... need to consider things like core load time, etc. We may want to think about lazy-loaded analyzers (but that could be another ball of wax since misconfigurations don't immediately fail).
        Hide
        Hoss Man added a comment -

        I like the idea of an export - it's transparent and neatly handles back compat concerns.

        that's the same conclusion robert and i came to on IRC ... being able to load directly sounds less redundent, but as soon as a user wants to customize (and let's face it: stop words can easily be domain specific) qe need a way of exporting that's convenient even for novice users who don't know anything about jars and wars.

        Not sure at this point if it makes more sense trying to put a text_fr, etc, in the normal schema.xml or in a separate schema_intl.xml.

        The idea robert pitched on IRC was to create a new example solr-instance directory with a barebones solrconfig.xml file, and a schema.xml file that only demonstrated fields using various tricks for various lanagues. All the language specific stopword files would then live in this new instancedir. The idea being that people interested in non-english fields, could find a "recommended" fieldtype declaration in this schema.xml file, and cut/paste it to their schema.xml (probably copied from the main example)

        The key here being that we don't want an entire clone of the example (all the numeric fields, and multiple request handler declarations,etc...) this will just show the syntax for declaring all the various langages that we can provide suggestions for.

        As far as file format: I think we sould also support the snowball stopword format.

        Agreed, but it's a trivially minor chicken/egg choice. Either we can setup a simple export and conversion to the format Solr currently supports now, and if/when someon updates StopFilterFactory to support the new format, then we can stop converting when we export; or we can modify StopFilter to support both formats first, and then setup the simple export w/o worrying about conversion.

        Frankly: If Robert's planning on doing the work either way, I'm happy to let him decide which approach makes the most sense.

        Show
        Hoss Man added a comment - I like the idea of an export - it's transparent and neatly handles back compat concerns. that's the same conclusion robert and i came to on IRC ... being able to load directly sounds less redundent, but as soon as a user wants to customize (and let's face it: stop words can easily be domain specific) qe need a way of exporting that's convenient even for novice users who don't know anything about jars and wars. Not sure at this point if it makes more sense trying to put a text_fr, etc, in the normal schema.xml or in a separate schema_intl.xml. The idea robert pitched on IRC was to create a new example solr-instance directory with a barebones solrconfig.xml file, and a schema.xml file that only demonstrated fields using various tricks for various lanagues. All the language specific stopword files would then live in this new instancedir. The idea being that people interested in non-english fields, could find a "recommended" fieldtype declaration in this schema.xml file, and cut/paste it to their schema.xml (probably copied from the main example) The key here being that we don't want an entire clone of the example (all the numeric fields, and multiple request handler declarations,etc...) this will just show the syntax for declaring all the various langages that we can provide suggestions for. As far as file format: I think we sould also support the snowball stopword format. Agreed, but it's a trivially minor chicken/egg choice. Either we can setup a simple export and conversion to the format Solr currently supports now, and if/when someon updates StopFilterFactory to support the new format, then we can stop converting when we export; or we can modify StopFilter to support both formats first, and then setup the simple export w/o worrying about conversion. Frankly: If Robert's planning on doing the work either way, I'm happy to let him decide which approach makes the most sense.
        Hide
        Robert Muir added a comment -

        Either we can setup a simple export and conversion to the format Solr currently supports now, and if/when someon updates StopFilterFactory to support the new format, then we can stop converting when we export

        Well, this isn't that big of a deal either way.

        In Lucene we have a helper class called WordListLoader that supports loading this format from an InputStream.

        One idea to consider: we could try merging some of what SolrResourceLoader does with this WordListLoader, then its all tested and in one place.
        it appears there might be some duplication of effort here... e.g. how long till a lucene user complains about UTF-8 bom markers in their stoplists

        We can still use ant to keep the files in sync automatically from the lucene copies.

        Show
        Robert Muir added a comment - Either we can setup a simple export and conversion to the format Solr currently supports now, and if/when someon updates StopFilterFactory to support the new format, then we can stop converting when we export Well, this isn't that big of a deal either way. In Lucene we have a helper class called WordListLoader that supports loading this format from an InputStream. One idea to consider: we could try merging some of what SolrResourceLoader does with this WordListLoader, then its all tested and in one place. it appears there might be some duplication of effort here... e.g. how long till a lucene user complains about UTF-8 bom markers in their stoplists We can still use ant to keep the files in sync automatically from the lucene copies.
        Hide
        Robert Muir added a comment -

        I'd still like to fix all this duplication between wordlistloader etc, but for now i will add the snowball stop support and introduce examples that use the embedded stopwords in the jar files.

        And as discussed on SOLR-2015, if we are gonna lay down traps for other languages like autogenerating phrase queries, then these should be in the main schema.xml, not tucked away.

        Show
        Robert Muir added a comment - I'd still like to fix all this duplication between wordlistloader etc, but for now i will add the snowball stop support and introduce examples that use the embedded stopwords in the jar files. And as discussed on SOLR-2015 , if we are gonna lay down traps for other languages like autogenerating phrase queries, then these should be in the main schema.xml, not tucked away.
        Hide
        Robert Muir added a comment -

        here is a first step, 2 of the analyzers (Brazilian, Czech) use embedded stopword sets.
        I think this was an oversight, this moves these to .txt files like the rest

        Show
        Robert Muir added a comment - here is a first step, 2 of the analyzers (Brazilian, Czech) use embedded stopword sets. I think this was an oversight, this moves these to .txt files like the rest
        Hide
        Robert Muir added a comment -

        committed this as rev 986612 (and 3x rev 986615).

        Show
        Robert Muir added a comment - committed this as rev 986612 (and 3x rev 986615).
        Hide
        Lance Norskog added a comment -

        This is a nice piece of work. One thing I've learned is that configurations should be as flat and transparent as possible. Pushing all of these word lists out of the classes and into files is a great improvement. The Greek Analyzer, for example, is (was) nothing but a default list of stopwords.

        But, having the stopwords as text files runs smack into character encoding wackiness (why, yes, I do use windows). Can the file format or importer at least support the XML or URL notations for Unicode characters? Maybe a list of words that include protɴ ge for protege?

        Show
        Lance Norskog added a comment - This is a nice piece of work. One thing I've learned is that configurations should be as flat and transparent as possible. Pushing all of these word lists out of the classes and into files is a great improvement. The Greek Analyzer, for example, is (was) nothing but a default list of stopwords. But, having the stopwords as text files runs smack into character encoding wackiness (why, yes, I do use windows). Can the file format or importer at least support the XML or URL notations for Unicode characters? Maybe a list of words that include protɴ ge for protege?
        Hide
        Robert Muir added a comment -

        The Greek Analyzer, for example, is (was) nothing but a default list of stopwords.

        This is no longer true. there is a stemmer, too.

        But, having the stopwords as text files runs smack into character encoding wackiness (why, yes, I do use windows).

        What wackiness? The files are all unicode UTF-8, which windows too supports.

        Can the file format or importer at least support the XML or URL notations for Unicode characters?

        Only if we escape with ALL english strings in all files too. But I prefer things to be readable.

        Show
        Robert Muir added a comment - The Greek Analyzer, for example, is (was) nothing but a default list of stopwords. This is no longer true. there is a stemmer, too. But, having the stopwords as text files runs smack into character encoding wackiness (why, yes, I do use windows). What wackiness? The files are all unicode UTF-8, which windows too supports. Can the file format or importer at least support the XML or URL notations for Unicode characters? Only if we escape with ALL english strings in all files too. But I prefer things to be readable.
        Hide
        Lance Norskog added a comment -

        What wackiness? The files are all unicode UTF-8, which windows too supports.

        'Supports' does not mean 'you can get it done without a pounding headache'. UTF-8 is not the default and you cannot make it the default. I'm guessing some linux editors don't understand the funky binary starting bytes that mark a UTF-8 file. Having UTF-8 characters in the Java source blows up also. An XML file format would go a long way to useability.

        .

        Show
        Lance Norskog added a comment - What wackiness? The files are all unicode UTF-8, which windows too supports. 'Supports' does not mean 'you can get it done without a pounding headache'. UTF-8 is not the default and you cannot make it the default. I'm guessing some linux editors don't understand the funky binary starting bytes that mark a UTF-8 file. Having UTF-8 characters in the Java source blows up also. An XML file format would go a long way to useability. .
        Hide
        Uwe Schindler added a comment -

        If it's documented to be UTF-8, its clear what you have to provide (in Solr). If you use Lucene directly, the stopword file parser does not care about encodings at all, it simply takes a java.io..Reader.

        Show
        Uwe Schindler added a comment - If it's documented to be UTF-8, its clear what you have to provide (in Solr). If you use Lucene directly, the stopword file parser does not care about encodings at all, it simply takes a java.io..Reader.
        Hide
        Robert Muir added a comment -

        Lance, I don't know what your OS problems are, but the whole reason it exists is so things like these files can be viewable/editable in their own languages and not encoded in hex.

        So, I don't plan on making life cryptic for people that use languages other than english because you are scared of UTF-8 or don't know how to configure your computer.

        Show
        Robert Muir added a comment - Lance, I don't know what your OS problems are, but the whole reason it exists is so things like these files can be viewable/editable in their own languages and not encoded in hex. So, I don't plan on making life cryptic for people that use languages other than english because you are scared of UTF-8 or don't know how to configure your computer.
        Hide
        Robert Muir added a comment -

        Now that Simon cleaned up wordlistloader, this is easy.

        Attached is a patch to support the snowball format (format="snowball") in StopFilterFactory and the common-grams factories.

        Along with something like the ant task in SOLR-3097, we should be able to move forwards with having some default configurations for other languages out-of-box.

        Show
        Robert Muir added a comment - Now that Simon cleaned up wordlistloader, this is easy. Attached is a patch to support the snowball format (format="snowball") in StopFilterFactory and the common-grams factories. Along with something like the ant task in SOLR-3097 , we should be able to move forwards with having some default configurations for other languages out-of-box.
        Hide
        Robert Muir added a comment -

        I committed this.

        Ill open up a new issue (related to SOLR-3097),
        to provide setups for other languages.

        Show
        Robert Muir added a comment - I committed this. Ill open up a new issue (related to SOLR-3097 ), to provide setups for other languages.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development