Solr
  1. Solr
  2. SOLR-4565

Extend NorwegianMinimalStemFilter to handle "nynorsk"

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4, 6.0
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Norway has two official languages, both called "Norwegian", namely Bokmål (nb_NO) and Nynorsk (nn_NO).

      The NorwegianMinimalStemFilter and NorwegianLightStemFilter today only works with the largest of the two, namely Bokmål.

      Propose to incorporate "nn" support through a new "vaiant" config option:

      • variant="nb" or not configured -> Bokmål as today
      • variant="nn" -> Nynorsk only
      • variant="no" -> Remove stems for both nb and nn
      1. SOLR-4565.patch
        27 kB
        Erlend Garåsen
      2. SOLR-4565.patch
        27 kB
        Robert Muir
      3. SOLR-4565.patch
        29 kB
        Erlend Garåsen
      4. SOLR-4565-schema-comments.patch
        1 kB
        Jan Høydahl

        Activity

        Hide
        Jan Høydahl added a comment -

        Erlend Garåsen, what do you think about this way of supporting nn in the same stemmer? I think it's better than adding two more classes to Lucene/Solr. Not sure about the name of the config param. Could also use "language" as snowball does, but nb and nn are more language variants than languages..

        Show
        Jan Høydahl added a comment - Erlend Garåsen , what do you think about this way of supporting nn in the same stemmer? I think it's better than adding two more classes to Lucene/Solr. Not sure about the name of the config param. Could also use "language" as snowball does, but nb and nn are more language variants than languages..
        Hide
        Erlend Garåsen added a comment -

        I think this is a reasonable approach. I will create a patch within a week or so.

        Not sure about the name of the config param.

        "variant" is a good suggestion.

        Show
        Erlend Garåsen added a comment - I think this is a reasonable approach. I will create a patch within a week or so. Not sure about the name of the config param. "variant" is a good suggestion.
        Hide
        Erlend Garåsen added a comment - - edited

        There are not so many differences between the stemming rules for these two languages. The only difference is that you must skip some rules for Nynorsk if you have configured the stemmer to only use Bokmål.

        Both Nynorsk and Bokmål have endings with "-ene", for instance many feminine indefinite nouns in plural form such as "jentene" (same for both languages). For these nouns, you must only exclude stemming for words ending with "-ane" if you have configured it for Bokmål.

        The same rules apply to masculine indefinite nouns in plural form for Nynorsk, i.e. endings with "-ar". The stemmer must skip those endings as long as only Bokmål is used.

        Show
        Erlend Garåsen added a comment - - edited There are not so many differences between the stemming rules for these two languages. The only difference is that you must skip some rules for Nynorsk if you have configured the stemmer to only use Bokmål. Both Nynorsk and Bokmål have endings with "-ene", for instance many feminine indefinite nouns in plural form such as "jentene" (same for both languages). For these nouns, you must only exclude stemming for words ending with "-ane" if you have configured it for Bokmål. The same rules apply to masculine indefinite nouns in plural form for Nynorsk, i.e. endings with "-ar". The stemmer must skip those endings as long as only Bokmål is used.
        Hide
        Erlend Garåsen added a comment -

        This patch includes Nynorsk support for both NorwegianMinimalStemFilter and NorwegianLightStemFilter. Their test classes have been extended accordingly.

        For backward-compatibility the stemmers will only use Bokmål if no variant is configured. Otherwise, the following variants are valid: no (for both Bokmål and Nynorsk), nn (for only Nynorsk) and nb (for only Bokmål).

        Show
        Erlend Garåsen added a comment - This patch includes Nynorsk support for both NorwegianMinimalStemFilter and NorwegianLightStemFilter. Their test classes have been extended accordingly. For backward-compatibility the stemmers will only use Bokmål if no variant is configured. Otherwise, the following variants are valid: no (for both Bokmål and Nynorsk), nn (for only Nynorsk) and nb (for only Bokmål).
        Hide
        Robert Muir added a comment -

        This looks nice: maybe instead of using String as a parameter, the stemmer can take enum or int flags? The latter seems simplest to me as "both" is allowed, and then we wouldnt need the useNynorsk() or useBokmaal() that do string comparisons.

        Show
        Robert Muir added a comment - This looks nice: maybe instead of using String as a parameter, the stemmer can take enum or int flags? The latter seems simplest to me as "both" is allowed, and then we wouldnt need the useNynorsk() or useBokmaal() that do string comparisons.
        Hide
        Erlend Garåsen added a comment - - edited

        I think we still need the two methods in order to determine whether we should stem for the two variants respectively, and for readability of course. I will try to explain below.

        For backward-compatibility, only Bokmål should be used if no variant is defined. Therefore, the useBokMaal method will also return true if there are no variants defined. The same applies if "no" is set as a variant. This means that both Nynorsk and Bokmål are set, and thus, useBokmaal should return true as well. I encapsulated this into methods for readability reasons.

        Show
        Erlend Garåsen added a comment - - edited I think we still need the two methods in order to determine whether we should stem for the two variants respectively, and for readability of course. I will try to explain below. For backward-compatibility, only Bokmål should be used if no variant is defined. Therefore, the useBokMaal method will also return true if there are no variants defined. The same applies if "no" is set as a variant. This means that both Nynorsk and Bokmål are set, and thus, useBokmaal should return true as well. I encapsulated this into methods for readability reasons.
        Hide
        Robert Muir added a comment -

        Here's a patch showing what i mean...

        also some of the endings should be reviewed, because tests didnt pass.

        i noticed -heten was configured for Nynorsk-only, but its expected to be removed according to the nb_light.txt test file.

        Show
        Robert Muir added a comment - Here's a patch showing what i mean... also some of the endings should be reviewed, because tests didnt pass. i noticed -heten was configured for Nynorsk-only, but its expected to be removed according to the nb_light.txt test file.
        Hide
        Erlend Garåsen added a comment -

        Here's a patch showing what i mean...

        +1
        I can create another patch including these changes.

        also some of the endings should be reviewed, because tests didnt pass.

        i noticed -heten was configured for Nynorsk-only, but its expected to be removed according to the nb_light.txt test file.

        The tests pass. -heten is handled correctly if you take a look in my first patch. -heten should only be configured for Bokmål, not Nynorsk:

        +         (endsWith(s, len, "heten") &&
        +          useBokmaal(variant)) ||  // general ending (hemmelig-heten -> hemmelig)
        

        The equivalent for this ending using Nynorsk is "-heita".

        My summer vacation starts tomorrow, so it might take a couple of weeks till I have another patch ready - unless I get some time to fulfill this task tomorrow.

        Show
        Erlend Garåsen added a comment - Here's a patch showing what i mean... +1 I can create another patch including these changes. also some of the endings should be reviewed, because tests didnt pass. i noticed -heten was configured for Nynorsk-only, but its expected to be removed according to the nb_light.txt test file. The tests pass. -heten is handled correctly if you take a look in my first patch. -heten should only be configured for Bokmål, not Nynorsk: + (endsWith(s, len, "heten" ) && + useBokmaal(variant)) || // general ending (hemmelig-heten -> hemmelig) The equivalent for this ending using Nynorsk is "-heita". My summer vacation starts tomorrow, so it might take a couple of weeks till I have another patch ready - unless I get some time to fulfill this task tomorrow.
        Hide
        Erlend Garåsen added a comment -

        Here's another patch including flags and with some corrections regarding -heten endings.

        Show
        Erlend Garåsen added a comment - Here's another patch including flags and with some corrections regarding -heten endings.
        Hide
        Robert Muir added a comment -

        Thanks for updating the patch before vacation Erlend. I reviewed it and especially compared Bokmål endings to the previous version, and its completely backwards compatible. I'm gonna commit soon.

        Show
        Robert Muir added a comment - Thanks for updating the patch before vacation Erlend. I reviewed it and especially compared Bokmål endings to the previous version, and its completely backwards compatible. I'm gonna commit soon.
        Hide
        Robert Muir added a comment -

        Thanks Erlend and Jan!

        Show
        Robert Muir added a comment - Thanks Erlend and Jan!
        Hide
        Jan Høydahl added a comment -

        Thanks Robert!

        Show
        Jan Høydahl added a comment - Thanks Robert!
        Hide
        Jan Høydahl added a comment -

        With this change, I propose also to update the comments in schema.txt to make the new "variant" param visible and understandable. See SOLR-4565-schema-comments.patch

        What do you think?

        Show
        Jan Høydahl added a comment - With this change, I propose also to update the comments in schema.txt to make the new "variant" param visible and understandable. See SOLR-4565 -schema-comments.patch What do you think?
        Hide
        Robert Muir added a comment -

        +1

        I think it would also be ok to make fieldtypes for the languages. e.g. I noticed the nynorsk-specific stopwords for example are clearly marked in the file from snowball. (Currently we just have a file for "no" that contains both). We could always do that later, too.

        Show
        Robert Muir added a comment - +1 I think it would also be ok to make fieldtypes for the languages. e.g. I noticed the nynorsk-specific stopwords for example are clearly marked in the file from snowball. (Currently we just have a file for "no" that contains both). We could always do that later, too.
        Hide
        Jan Høydahl added a comment -

        Committed this simple change to trunk (r1498948) and branch_4x (r1498951) to get it in 4.4. Later we could add specific text_nn and text_nb fields. Especially if language detection is also extended to detect the variants...

        PS: I messed up the commit message with wrong JIRA number SOLR-4412, so had to remove the comments made by the very nice commit tag bot from there...

        Show
        Jan Høydahl added a comment - Committed this simple change to trunk (r1498948) and branch_4x (r1498951) to get it in 4.4. Later we could add specific text_nn and text_nb fields. Especially if language detection is also extended to detect the variants... PS: I messed up the commit message with wrong JIRA number SOLR-4412 , so had to remove the comments made by the very nice commit tag bot from there...
        Hide
        Steve Rowe added a comment -

        Bulk close resolved 4.4 issues

        Show
        Steve Rowe added a comment - Bulk close resolved 4.4 issues

          People

          • Assignee:
            Robert Muir
            Reporter:
            Jan Høydahl
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development