Lucene - Core
  1. Lucene - Core
  2. LUCENE-87

[PATCH] GermanAnalyzer problems with upper/lower case

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: PC

      Description

      Hello!

      If noticed some strange problems of the german analyzer when using field search
      for texts consisting of more than one word. For example, I had to documents in
      the search index, one had a field set to "Anfrage von mir", the other one had
      it set to "Ticket von mir". While the search for "fieldname:anfrage" returned
      the expected document, "fieldname:ticket" did not return the document. After
      removing the special treatment of upper case words in the GermanStemmer, it
      worked properly.

      All the best
      Philipp

        Activity

        Hide
        Otis Gospodnetic added a comment -

        The start of sentence vs. noun comment - I see.

        I have make this change..... although it breaks backwards-compatibility of
        German Analyzer.

        Show
        Otis Gospodnetic added a comment - The start of sentence vs. noun comment - I see. I have make this change..... although it breaks backwards-compatibility of German Analyzer.
        Hide
        Daniel Naber added a comment -

        Otis,

        the problem with uppercase is that any word at the beginning of a sentence starts with
        an uppercase character (just like in English). So unless you've got a sophisticated
        sentence boundary detection you cannot conclude that a word is a noun just because
        it starts with an uppercase character.

        Comment #2 had an example: "ähnelt" (a verb) vs. "Ähnelt" (the same verb, but
        appearing at the beginning of a sentence – which is okay).

        I didn't have a closer look at the Snowball stemmers, so I cannot comment on that.

        Show
        Daniel Naber added a comment - Otis, the problem with uppercase is that any word at the beginning of a sentence starts with an uppercase character (just like in English). So unless you've got a sophisticated sentence boundary detection you cannot conclude that a word is a noun just because it starts with an uppercase character. Comment #2 had an example: "ähnelt" (a verb) vs. "Ähnelt" (the same verb, but appearing at the beginning of a sentence – which is okay). I didn't have a closer look at the Snowball stemmers, so I cannot comment on that.
        Hide
        Otis Gospodnetic added a comment -

        Daniel,
        Thanks for the patch. Before I apply it, could you please explain to me why it
        is okay to ignore upper/lower case characters for a German language stemmer?
        Nouns are upper-cased in German, so wouldn't the case have a special meaning to
        consider before stemming a word?

        Furthermore, would you happen to know whether this GermanStemmer is superior or
        different than the 2 Snowball stemmers for German?

        Thanks.

        Show
        Otis Gospodnetic added a comment - Daniel, Thanks for the patch. Before I apply it, could you please explain to me why it is okay to ignore upper/lower case characters for a German language stemmer? Nouns are upper-cased in German, so wouldn't the case have a special meaning to consider before stemming a word? Furthermore, would you happen to know whether this GermanStemmer is superior or different than the 2 Snowball stemmers for German? Thanks.
        Hide
        Daniel Naber added a comment -

        Created an attachment (id=11050)
        bug fix + other small enhancements, see my comment

        Show
        Daniel Naber added a comment - Created an attachment (id=11050) bug fix + other small enhancements, see my comment
        Hide
        Daniel Naber added a comment -

        Here's a patch that fixes the bug and does a bit more, obsoleting all other attachments
        to this report. What it does:

        GermanAnalyzer.java:
        -use LowerCaseFilter
        -Hashtable -> HashSet, deprecate the old methods

        GermanStemmer.java:
        -no special handling for uppercase words, this confuses people more than it helps

        WordListLoader:
        -avoid silent failure for null filenames
        -trim() the lines from the stopword file
        -simplify implementation, using HashSet add instead of array copying
        -add a TODO: this isn't specific for German, should be moved

        I hope this can be applied before 1.4 is released.

        Show
        Daniel Naber added a comment - Here's a patch that fixes the bug and does a bit more, obsoleting all other attachments to this report. What it does: GermanAnalyzer.java: -use LowerCaseFilter -Hashtable -> HashSet, deprecate the old methods GermanStemmer.java: -no special handling for uppercase words, this confuses people more than it helps WordListLoader: -avoid silent failure for null filenames -trim() the lines from the stopword file -simplify implementation, using HashSet add instead of array copying -add a TODO: this isn't specific for German, should be moved I hope this can be applied before 1.4 is released.
        Hide
        Daniel Naber added a comment -

        I added an attachment that does the same as attachment 6543, only that it's a clean
        patch against the latest CVS version.

        Show
        Daniel Naber added a comment - I added an attachment that does the same as attachment 6543, only that it's a clean patch against the latest CVS version.
        Hide
        Daniel Naber added a comment -

        Created an attachment (id=11027)
        no special uppercase handling

        Show
        Daniel Naber added a comment - Created an attachment (id=11027) no special uppercase handling
        Hide
        Daniel Naber added a comment -
            • Bug 12569 has been marked as a duplicate of this bug. ***
        Show
        Daniel Naber added a comment - Bug 12569 has been marked as a duplicate of this bug. ***
        Hide
        Philipp Meister added a comment -

        Mirko, the two files I have attached are copies of the original classes except
        of the fact that they ingore the difference between lowercase and uppercase.

        Show
        Philipp Meister added a comment - Mirko, the two files I have attached are copies of the original classes except of the fact that they ingore the difference between lowercase and uppercase.
        Hide
        Philipp Meister added a comment -

        Created an attachment (id=6543)
        Stemmer that ignores upper/lowercase

        Show
        Philipp Meister added a comment - Created an attachment (id=6543) Stemmer that ignores upper/lowercase
        Hide
        Philipp Meister added a comment -

        Created an attachment (id=6542)
        Analyzer that ignores upper/lowercase

        Show
        Philipp Meister added a comment - Created an attachment (id=6542) Analyzer that ignores upper/lowercase
        Hide
        Mirko Ebert added a comment -

        I have an additional example:
        the result of the query "Aehnelt" is hot equal to result of the query "aehnelt"

        Show
        Mirko Ebert added a comment - I have an additional example: the result of the query "Aehnelt" is hot equal to result of the query "aehnelt"

          People

          • Assignee:
            Lucene Developers
            Reporter:
            Philipp Meister
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development