Lucene.Net
  1. Lucene.Net
  2. LUCENENET-466

optimisation for the GermanStemmer.vb‏

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: Lucene.Net 2.9.4, Lucene.Net 2.9.4g, Lucene.Net 3.0.3
    • Fix Version/s: Lucene.Net 3.0.3
    • Component/s: Lucene.Net Contrib
    • Labels:
      None

      Description

      I have a little optimisation for the GermanStemmer.vb (in
      Contrib.Analyzers) class. At the moment the function "Substitute"
      converts the german "Umlaute" "ä" in "a", "ö" in"o" and "ü" in "u". This
      is not the correct german translation. They must be converted to "ae",
      "oe" and "ue". So I can write the name "Björn" or "Bjoern" but not
      "Bjorn". With this optimization a user can search for "Björn" and also
      find "Bjoern".

      Here is the optimized code snippet:

      else if ( buffer[c] == 'ä' )

      { buffer[c] = 'a'; buffer.Insert(c + 1, 'e'); }

      else if ( buffer[c] == 'ö' )

      { buffer[c] = 'o'; buffer.Insert(c + 1,'e'); }

      else if ( buffer[c] == 'ü' )

      { buffer[c] = 'u'; buffer.Insert(c + 1,'e'); }

      Thank You
      Björn

      1. DIN2Stemmer.patch
        7 kB
        Christopher Currens

        Activity

        Show
        Björn added a comment - Please take a look at the whole conversation: http://mail-archives.apache.org/mod_mbox/lucene-lucene-net-dev/201201.mbox/%3C4F143C84.6030302@patorg.de%3E
        Hide
        Christopher Currens added a comment -

        Since both DIN-5007-1 and DIN-5007-2 are both valid ways of sorting they should probably both be included as an option. DIN-5007-1 is used for words, and is the current version of the GermanStemmer class. DIN-5007-2 is a special sorting for lists of names (phone book sorting). Either way, I can see where it could be beneficial to have both. Since I don't want to diverge from the Java stemmer too much, I think it should probably just be an additional constructor on the GermanAnalyzer class that would allow you to pass a bool if you want to use DIN-5007-2.

        For reference:

        Letter DIN-5007-1 DIN5007-2
        ä a ae
        ö o oe
        ü u ue
        ß ss ss
        Show
        Christopher Currens added a comment - Since both DIN-5007-1 and DIN-5007-2 are both valid ways of sorting they should probably both be included as an option. DIN-5007-1 is used for words, and is the current version of the GermanStemmer class. DIN-5007-2 is a special sorting for lists of names (phone book sorting). Either way, I can see where it could be beneficial to have both. Since I don't want to diverge from the Java stemmer too much, I think it should probably just be an additional constructor on the GermanAnalyzer class that would allow you to pass a bool if you want to use DIN-5007-2. For reference: Letter DIN-5007-1 DIN5007-2 ä a ae ö o oe ü u ue ß ss ss
        Hide
        Christopher Currens added a comment -

        I've added a new stemmer in trunk called GermanDIN2Stemmer. You can specify GermanAnalyzer use it via some new constructors that take a bool indicating if you want to use the DIN-5007-2 stemmer instead of the default DIN-5007-1 stemmer.

        This won't break compatibility with users who want to use the old default DIN1 stemmer, but enables anyone who wants to use the other.

        Show
        Christopher Currens added a comment - I've added a new stemmer in trunk called GermanDIN2Stemmer. You can specify GermanAnalyzer use it via some new constructors that take a bool indicating if you want to use the DIN-5007-2 stemmer instead of the default DIN-5007-1 stemmer. This won't break compatibility with users who want to use the old default DIN1 stemmer, but enables anyone who wants to use the other.
        Hide
        Björn added a comment -

        Hello,

        maybe it's a good idea to combine the DIN1 and the DIN2 algorithm. At the moment the DIN2 stemmer "destroys" the root of the word:

        Haus => Haus
        Häuser => Haeuser
        Haeuser => Haeuser

        DIN1 means:
        ä = a
        DIN2 means:
        ä = ae

        So we could implicit say: ä = ae = a. This corrects the "root" problem:

        Haus => Haus
        Häuser => Hauser
        Haeuser => Hauser

        Greetings
        Björn

        Show
        Björn added a comment - Hello, maybe it's a good idea to combine the DIN1 and the DIN2 algorithm. At the moment the DIN2 stemmer "destroys" the root of the word: Haus => Haus Häuser => Haeuser Haeuser => Haeuser DIN1 means: ä = a DIN2 means: ä = ae So we could implicit say: ä = ae = a. This corrects the "root" problem: Haus => Haus Häuser => Hauser Haeuser => Hauser Greetings Björn
        Hide
        Christopher Currens added a comment -

        I see what you're saying. I missed that in the original conversation that was linked to in an earlier comment.

        "ue" occurs pretty often as an infix (think of steuer): about 1.5%
        of the words of the German aspell dictionary are affected. "ae" and
        "oe" are rather seldom.

        Still, it may be worth a try, because the stemmer doesn't work
        morphologically anyway. It doesn't really matter if "steuer" is
        stemmed as "steur" or "steu" as long as it's consistent.

        I'm thinking that as long as it is made clear that this behavior is in the second stemmer, this would probably be an okay change to make as the second option in a way that doesn't break the root of the word.

        Show
        Christopher Currens added a comment - I see what you're saying. I missed that in the original conversation that was linked to in an earlier comment. "ue" occurs pretty often as an infix (think of steuer ): about 1.5% of the words of the German aspell dictionary are affected. "ae" and "oe" are rather seldom. Still, it may be worth a try, because the stemmer doesn't work morphologically anyway. It doesn't really matter if "steuer" is stemmed as "steur" or "steu" as long as it's consistent. I'm thinking that as long as it is made clear that this behavior is in the second stemmer, this would probably be an okay change to make as the second option in a way that doesn't break the root of the word.
        Hide
        Christopher Currens added a comment -

        Bjorn,

        I've made this patch from the src/contrib/Analyzers folder, on top of the DIN2 changes already committed to trunk. Since the extent of my German is "danke!", I was hoping you could see if this stemmer is working properly before I commit it to trunk.

        These were the test cases I made that should hopefully emulate the results of the normal DIN1 stemmer, where the word left of the semicolon is the word, and to the right, the result.

        # Test cases for words with ae, ue, or oe in them
        Haus;hau
        Hauses;hau
        Haeuser;hau
        Haeusern;hau
        steuer;steur
        rueckwaerts;ruckwar
        geheimtuer;geheimtur
        

        With the last word in particular, it produces fairly different results in each stemmer, though I think they are expected, due to the different DIN.

        Also, the DIN2 stemmer will also translate 'Häuser' and 'Häusern' properly (to hau), so there is support for both umlauts and the expanded 'ae', 'oe' and 'ue' forms.

        Show
        Christopher Currens added a comment - Bjorn, I've made this patch from the src/contrib/Analyzers folder, on top of the DIN2 changes already committed to trunk. Since the extent of my German is "danke!", I was hoping you could see if this stemmer is working properly before I commit it to trunk. These were the test cases I made that should hopefully emulate the results of the normal DIN1 stemmer, where the word left of the semicolon is the word, and to the right, the result. # Test cases for words with ae, ue, or oe in them Haus;hau Hauses;hau Haeuser;hau Haeusern;hau steuer;steur rueckwaerts;ruckwar geheimtuer;geheimtur With the last word in particular, it produces fairly different results in each stemmer, though I think they are expected, due to the different DIN. Also, the DIN2 stemmer will also translate 'Häuser' and 'Häusern' properly (to hau), so there is support for both umlauts and the expanded 'ae', 'oe' and 'ue' forms.
        Hide
        Björn added a comment -

        The Code looks good. Thank You!

        Show
        Björn added a comment - The Code looks good. Thank You!
        Hide
        Christopher Currens added a comment -

        Committed to trunk

        Show
        Christopher Currens added a comment - Committed to trunk

          People

          • Assignee:
            Unassigned
            Reporter:
            Prescott Nasser
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development