Solr
  1. Solr
  2. SOLR-2764

Create a NorwegianLightStemmer and NorwegianMinimalStemmer

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      We need a simple light-weight stemmer and a minimal stemmer for plural/singlular only in Norwegian

      1. SOLR-2764.patch
        37 kB
        Jan Høydahl
      2. SOLR-2764.patch
        36 kB
        Jan Høydahl
      3. SOLR-2764.patch
        31 kB
        Christian Moen
      4. SOLR-2764.patch
        33 kB
        Jan Høydahl
      5. SOLR-2764.patch
        15 kB
        Jan Høydahl

        Activity

        Hide
        Jan Høydahl added a comment -

        One idea is try the Hunspell stemmer and modify the .aff file to only do plural/singular of nouns

        Show
        Jan Høydahl added a comment - One idea is try the Hunspell stemmer and modify the .aff file to only do plural/singular of nouns
        Hide
        Chris Male added a comment -

        I don't know much about Norwegian, but I think its best to follow the same model as the other light / minimal stemmers. They are incredibly efficient, targeted and easy to understand.

        Show
        Chris Male added a comment - I don't know much about Norwegian, but I think its best to follow the same model as the other light / minimal stemmers. They are incredibly efficient, targeted and easy to understand.
        Hide
        Jan Høydahl added a comment -

        Unfortunately the rules for Noun conjugation is much more complex in Norwegian than English, and there are many irregularities.

        Show
        Jan Høydahl added a comment - Unfortunately the rules for Noun conjugation is much more complex in Norwegian than English, and there are many irregularities.
        Hide
        Robert Muir added a comment -

        I would leave the irregularities out (e.g. just like our english one basically 'strips the s)'.
        someone can always deal with exceptions with their own list: stemmerOverrideFilter etc

        i dont know anything about norwegian but you can take the other languages as examples here,
        and create the ruleset for the most common nominal inflections... e.g. strip

        { -a, -ene, -en, -er, -et }


        or whatever.

        Show
        Robert Muir added a comment - I would leave the irregularities out (e.g. just like our english one basically 'strips the s)'. someone can always deal with exceptions with their own list: stemmerOverrideFilter etc i dont know anything about norwegian but you can take the other languages as examples here, and create the ruleset for the most common nominal inflections... e.g. strip { -a, -ene, -en, -er, -et } or whatever.
        Hide
        Jan Høydahl added a comment -

        First attempt at a NorwegianLightStemmer, adapted from the Swedish one.

        Can someone come up with a larger test corups? Not sure if I got all the rules yet.

        Show
        Jan Høydahl added a comment - First attempt at a NorwegianLightStemmer, adapted from the Swedish one. Can someone come up with a larger test corups? Not sure if I got all the rules yet.
        Hide
        Robert Muir added a comment -

        Looks nice to me actually. I can't tell from the test data what you are using already (binary file),
        but a few suggestions for testing (this was the process I used before):

        • the existing actual+expected testdata for the light stemmers were generated by running the C implementations against snowball vocabulary sets,
          I took the vocabulary files from snowball (the voc.txt in TestSnowballVocabData.zip), and ran the original implementations over them
          and created expected output. This is just a broad check that our implementation does the same thing as the original C one.
          I'm not sure how great of a vocabulary set that is for norwegian though.
        • in this case, you don't actually have an existing evaluated impl you are trying to conform to, so this test is not so useful,
          except to check for PorterStemmer-type JRE crashes and to ensure any future refactorings aren't changing the algorithm (breaking index back compat).

        Personally at a glance this looks pretty conservative and nice, but I think since there is no published algorithm to refer to, it might be nice
        to add some notes to the Stemmer's java file describing some high level stuff, and also some individual tests that are just examples showing what it does.

        Take a look at Latvian (lv) for an example. In this case the algorithm is not exactly what was published in the referred phd thesis,
        I did actually implement the original algorithm but my tests found it to be extremely aggressive... so its similar to your case I think.

        Show
        Robert Muir added a comment - Looks nice to me actually. I can't tell from the test data what you are using already (binary file), but a few suggestions for testing (this was the process I used before): the existing actual+expected testdata for the light stemmers were generated by running the C implementations against snowball vocabulary sets, I took the vocabulary files from snowball (the voc.txt in TestSnowballVocabData.zip), and ran the original implementations over them and created expected output. This is just a broad check that our implementation does the same thing as the original C one. I'm not sure how great of a vocabulary set that is for norwegian though. in this case, you don't actually have an existing evaluated impl you are trying to conform to, so this test is not so useful, except to check for PorterStemmer-type JRE crashes and to ensure any future refactorings aren't changing the algorithm (breaking index back compat). Personally at a glance this looks pretty conservative and nice, but I think since there is no published algorithm to refer to, it might be nice to add some notes to the Stemmer's java file describing some high level stuff, and also some individual tests that are just examples showing what it does. Take a look at Latvian (lv) for an example. In this case the algorithm is not exactly what was published in the referred phd thesis, I did actually implement the original algorithm but my tests found it to be extremely aggressive... so its similar to your case I think.
        Hide
        Christian Moen added a comment -

        Jan, could you attach nolighttestdata.zip? Many thanks.

        Show
        Christian Moen added a comment - Jan, could you attach nolighttestdata.zip ? Many thanks.
        Hide
        Jan Høydahl added a comment -

        Thanks for reviewing.

        You're right - we don't have a reference implementation or corpus that we can validate against, so forking the Swedish LightStemmer was perhaps a bit optimistic. So it needs some more love

        Then I created a NorwegianMinimalStemmer only for nouns and -s endings. It is much simpler to hand-craft and it works pretty well to my taste

        Since the test dictionary is small, I changed it into plaintext rather than .zip, so now both dictionaries are in the patch.

        Show
        Jan Høydahl added a comment - Thanks for reviewing. You're right - we don't have a reference implementation or corpus that we can validate against, so forking the Swedish LightStemmer was perhaps a bit optimistic. So it needs some more love Then I created a NorwegianMinimalStemmer only for nouns and -s endings. It is much simpler to hand-craft and it works pretty well to my taste Since the test dictionary is small, I changed it into plaintext rather than .zip, so now both dictionaries are in the patch.
        Hide
        Christian Moen added a comment -

        I added a few entries to the tests, including some irregular ones, to validate and illustrate how the stemmer works in these cases. Jan, looks good to me. +1

        Show
        Christian Moen added a comment - I added a few entries to the tests, including some irregular ones, to validate and illustrate how the stemmer works in these cases. Jan, looks good to me. +1
        Hide
        Jan Høydahl added a comment - - edited

        Thanks Christian. I further refined stuff:

        • I think the MinimalStemmer is more or less good to go, it seems to do what it's supposed to
        • For LightStemmer, we now do "two-pass" removal for the -dom and -het endings. This means that the word "kristendom" will first be stemmed to "kristen", and then all the general rules apply so it will be further stemmed to "krist". The effect of this is that both "kristen,kristendom,kristendommen,kristendommens" will all be stemmed to "krist" (due to in this case incorrect interpretation of -en as singular definite ending).
        • Added some more tests to highlight this

        What do you think, is this -dom -het thing a reasonable improvement or could there be side effects?

        Are there some other general rules that could easily be incorporated to catch semi-regular conjugations for the light stemmer?

        Show
        Jan Høydahl added a comment - - edited Thanks Christian. I further refined stuff: I think the MinimalStemmer is more or less good to go, it seems to do what it's supposed to For LightStemmer, we now do "two-pass" removal for the -dom and -het endings. This means that the word "kristendom" will first be stemmed to "kristen", and then all the general rules apply so it will be further stemmed to "krist". The effect of this is that both "kristen,kristendom,kristendommen,kristendommens" will all be stemmed to "krist" (due to in this case incorrect interpretation of -en as singular definite ending). Added some more tests to highlight this What do you think, is this -dom -het thing a reasonable improvement or could there be side effects? Are there some other general rules that could easily be incorporated to catch semi-regular conjugations for the light stemmer?
        Hide
        Robert Muir added a comment -

        just some general suggestions:

        in a light stemmer, i would be wary of derivational endings.
        it seems in the case of dom/het because its dealing with adj/noun that its
        on the edge (maybe ok here), but if possible it would be more ideal to
        avoid multiple passes... this is the kind of thing that causes snowball problems.

        Can you think of examples for dom/het where the meaning would be changed?

        for example: "freedom" is used the same way in english, but stemming this
        to "free" is very lossy, since free has a variety of meanings (such as costs nothing),
        some of which are incompatible with "freedom". This is the danger of stripping
        derivational suffixes...

        Show
        Robert Muir added a comment - just some general suggestions: in a light stemmer, i would be wary of derivational endings. it seems in the case of dom/het because its dealing with adj/noun that its on the edge (maybe ok here), but if possible it would be more ideal to avoid multiple passes... this is the kind of thing that causes snowball problems. Can you think of examples for dom/het where the meaning would be changed? for example: "freedom" is used the same way in english, but stemming this to "free" is very lossy, since free has a variety of meanings (such as costs nothing), some of which are incompatible with "freedom". This is the danger of stripping derivational suffixes...
        Hide
        Jan Høydahl added a comment -

        When looking at words enging in -het and -dom in dictionaries (such as Ooo nb_NO.dic), the base word has the same meaning in the vast majority of cases. But of course there will be exceptions. Take the word "brennhet" (het as in hot), it will be stemmed to "brenn" -> "bren" which is kind of wrong, but then "bren" is not a valid word so it won't cause errors. There may be such cases where the final stem clashes with another word, but not more than the base rules. I.e. there is a Norwegian surname "Brenna" which will be stemmed to "brenn" by the "-a" rule, believing it's a fem.definite ending, and then we get a clash with the verb "brenn" (burn). And the first name "Tore" (boy) or "Tora" (girl) will be stemmed to "Tor" (boy) which is another valid first name...

        My hunch is that the -dom/-het rules make more good than wrong. Mainly because in the majority of cases it leads to the base word and the -het/-dom word being stemmed to the same stem in cases where the "-en/-et/-a/-e/-n" rule are applied wrongly. Example:

        One pass                       Two passes
        forlegen        forleg         forlegen        forleg
        forlegenhet     forlegen       forlegenhet     forleg
        forlegenheten   forlegen       forlegenheten   forleg
        forlegenhetens  forlegen       forlegenhetens  forleg
        firkantet       firkant        firkantet       firkant
        firkantethet    firkantet      firkantethet    firkant
        firkantetheten  firkantet      firkantetheten  firkant
        

        But I think maybe the rules -dommer and -dommen should be removed, because the word dommer (judge) and dommen (the sentence) are both common words valid in word endings. So the word "linjedommer" (linesman) would be stemmed to "linje" (line) which is too aggressive.

        I see that it soon gets complicated to try to be clever. Should we go back to the one-pass again for the light stemmer? Christian?

        Show
        Jan Høydahl added a comment - When looking at words enging in -het and -dom in dictionaries (such as Ooo nb_NO.dic), the base word has the same meaning in the vast majority of cases. But of course there will be exceptions. Take the word "brennhet" (het as in hot), it will be stemmed to "brenn" -> "bren" which is kind of wrong, but then "bren" is not a valid word so it won't cause errors. There may be such cases where the final stem clashes with another word, but not more than the base rules. I.e. there is a Norwegian surname "Brenna" which will be stemmed to "brenn" by the "-a" rule, believing it's a fem.definite ending, and then we get a clash with the verb "brenn" (burn). And the first name "Tore" (boy) or "Tora" (girl) will be stemmed to "Tor" (boy) which is another valid first name... My hunch is that the -dom/-het rules make more good than wrong. Mainly because in the majority of cases it leads to the base word and the -het/-dom word being stemmed to the same stem in cases where the "-en/-et/-a/-e/-n" rule are applied wrongly. Example: One pass Two passes forlegen forleg forlegen forleg forlegenhet forlegen forlegenhet forleg forlegenheten forlegen forlegenheten forleg forlegenhetens forlegen forlegenhetens forleg firkantet firkant firkantet firkant firkantethet firkantet firkantethet firkant firkantetheten firkantet firkantetheten firkant But I think maybe the rules -dommer and -dommen should be removed, because the word dommer (judge) and dommen (the sentence) are both common words valid in word endings. So the word "linjedommer" (linesman) would be stemmed to "linje" (line) which is too aggressive. I see that it soon gets complicated to try to be clever. Should we go back to the one-pass again for the light stemmer? Christian?
        Hide
        Robert Muir added a comment -

        Jan, i wasn't trying to be critical about these endings, because of course some of the existing light stemmers
        have a few selected derivational endings that are taken care of. And thats really what its all about,
        when we are talking about something like adjective->noun, I didnt mean to say we shouldn't do it, because
        it sounds quite reasonable: but we should explore the options.

        For example, as an alternative to multi-pass, a 'less elegant to some' but really practical way to go about it
        can be to 'multiply through' and convert the possibilities to single-pass.

        E.g. the typical 'undrinkables' hunspell example: if i have the english inflectional plural ending -s and the
        derivational ending -able, instead of:

        • pass 1: remove inflectional endings (e.g. -s)
        • pass 2: remove derivational endings (e.g. -able)

        we just take all the pass 2 endings that are compatible with pass 1 endings and cross-multiply, to make a single
        pass algorithm. some won't be compatible, (so we won't combine -able + -s into -ables).

        I'm not sure if this is helpful for the norwegian case as I'm not as familiar with it, just an idea.

        Show
        Robert Muir added a comment - Jan, i wasn't trying to be critical about these endings, because of course some of the existing light stemmers have a few selected derivational endings that are taken care of. And thats really what its all about, when we are talking about something like adjective->noun, I didnt mean to say we shouldn't do it, because it sounds quite reasonable: but we should explore the options. For example, as an alternative to multi-pass, a 'less elegant to some' but really practical way to go about it can be to 'multiply through' and convert the possibilities to single-pass. E.g. the typical 'undrinkables' hunspell example: if i have the english inflectional plural ending -s and the derivational ending -able, instead of: pass 1: remove inflectional endings (e.g. -s) pass 2: remove derivational endings (e.g. -able) we just take all the pass 2 endings that are compatible with pass 1 endings and cross-multiply, to make a single pass algorithm. some won't be compatible, (so we won't combine -able + -s into -ables). I'm not sure if this is helpful for the norwegian case as I'm not as familiar with it, just an idea.
        Hide
        Jan Høydahl added a comment -

        Will try to prepare a new patch for this when time allows, with one-pass.

        Show
        Jan Høydahl added a comment - Will try to prepare a new patch for this when time allows, with one-pass.
        Hide
        Jan Høydahl added a comment -

        Updated patch with single-pass -het -dom. Tests adjusted and passes. Think this is ready now.

        Show
        Jan Høydahl added a comment - Updated patch with single-pass -het -dom. Tests adjusted and passes. Think this is ready now.
        Hide
        Jan Høydahl added a comment -

        Committed to trunk and branch_3x

        Show
        Jan Høydahl added a comment - Committed to trunk and branch_3x
        Hide
        Robert Muir added a comment -

        Very nice work Jan!

        Show
        Robert Muir added a comment - Very nice work Jan!

          People

          • Assignee:
            Jan Høydahl
            Reporter:
            Jan Høydahl
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development