Solr
  1. Solr
  2. SOLR-2792

Allow case insensitive Hunspell stemming

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.5, 4.0-ALPHA
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Same as http://code.google.com/p/lucene-hunspell/issues/detail?id=3

      Hunspell dictionaries are by nature case sensitive. The Hunspell stemmer thus needs an option to allow case insensitive matching of the dictionaries.

      Imagine a query for "microsofts". It will never be stemmed to the dictionary word "Microsoft" because of the case difference. This problem cannot be fixed by putting LowercaseFilter before Hunspell.

      1. SOLR-2792.patch
        17 kB
        Jan Høydahl
      2. SOLR-2792_3x.patch
        17 kB
        Jan Høydahl
      3. SOLR-2792.patch
        17 kB
        Jan Høydahl
      4. SOLR-2792.patch
        17 kB
        Robert Muir
      5. SOLR-2792.patch
        16 kB
        Jan Høydahl
      6. SOLR-2792.patch
        17 kB
        Jan Høydahl
      7. SOLR-2792.patch
        17 kB
        Jan Høydahl

        Activity

        Hide
        Jan Høydahl added a comment -

        Propose an option ignoreCase="true" for HunspellStemFilterFactory, which effectively lowercases everything before matching.

        Show
        Jan Høydahl added a comment - Propose an option ignoreCase="true" for HunspellStemFilterFactory, which effectively lowercases everything before matching.
        Hide
        Jan Høydahl added a comment -

        First attempt on patch. It works, and I added an -i option to the command-line version of HunspellStemmer so that we can test ignoreCase easily.

        I also added an optional ignoreCaseLocale to specify the locale for lowercasing, but this only works for String#toLowerCase, while the lowercasing in CharArrayMap#equals uses plain Character#toLowerCase.

        Show
        Jan Høydahl added a comment - First attempt on patch. It works, and I added an -i option to the command-line version of HunspellStemmer so that we can test ignoreCase easily. I also added an optional ignoreCaseLocale to specify the locale for lowercasing, but this only works for String#toLowerCase, while the lowercasing in CharArrayMap#equals uses plain Character#toLowerCase.
        Hide
        Robert Muir added a comment -

        I dont think ignoreCaseLocale is a good idea: it doesn't make sense, for the same reason that String.equalsIgnoreCase does not take locale.

        String.toLowerCase is only for creating a lowercase presentation to the user, not case-insensitive matching.

        Show
        Robert Muir added a comment - I dont think ignoreCaseLocale is a good idea: it doesn't make sense, for the same reason that String.equalsIgnoreCase does not take locale. String.toLowerCase is only for creating a lowercase presentation to the user, not case-insensitive matching.
        Hide
        Jan Høydahl added a comment -

        Ok, makes sense. New patch which removes ignoreCaseLocale, improves javadoc, and fixes a test bug for affix file.

        I wonder why CharArrayMap#equals did only lowercase text1 and not text2 in the compare. Does my fix look right?

        Show
        Jan Høydahl added a comment - Ok, makes sense. New patch which removes ignoreCaseLocale, improves javadoc, and fixes a test bug for affix file. I wonder why CharArrayMap#equals did only lowercase text1 and not text2 in the compare. Does my fix look right?
        Hide
        Robert Muir added a comment -

        Does my fix look right?

        No, the entries in the map are already lowercased (this happens on add)

        Show
        Robert Muir added a comment - Does my fix look right? No, the entries in the map are already lowercased (this happens on add)
        Hide
        Jan Høydahl added a comment -

        New patch which does not modify CharArrayMap. Also, it lowercases the char[] using a for-loop with Character.toLowerCase instead of the round-trip via a String object.

        Show
        Jan Høydahl added a comment - New patch which does not modify CharArrayMap. Also, it lowercases the char[] using a for-loop with Character.toLowerCase instead of the round-trip via a String object.
        Hide
        Robert Muir added a comment -

        looks batter: a couple questions:

        • in the part reading the dictionary, we should avoid the String.toLowerCase() without any locale here, at least use String.toLowerCase(Locale.ENGLISH) for consistency?
        • shouldn't we case fold the affixes too? however, i'm guessing most of these are already in lowercase.

        Also, its been a while since I looked at the code, but are we "merging" dictionary entries here (I think we should in this lower-casing mode?)

        For example:

        foo/A
        Foo/B

        I think we want to combine flags to create foo/AB ? Maybe we are doing this already though

        Show
        Robert Muir added a comment - looks batter: a couple questions: in the part reading the dictionary, we should avoid the String.toLowerCase() without any locale here, at least use String.toLowerCase(Locale.ENGLISH) for consistency? shouldn't we case fold the affixes too? however, i'm guessing most of these are already in lowercase. Also, its been a while since I looked at the code, but are we "merging" dictionary entries here (I think we should in this lower-casing mode?) For example: foo/A Foo/B I think we want to combine flags to create foo/AB ? Maybe we are doing this already though
        Hide
        Jan Høydahl added a comment -

        Robert,

        in the part reading the dictionary, we should avoid the String.toLowerCase() without any locale here, at least use String.toLowerCase(Locale.ENGLISH) for consistency?

        Yep, for consistency it's probably better to lowercase using an explicit locale than system default. I tested with my name, and Locale.ENGLISH converts Ø->ø, so I'm happy

        shouldn't we case fold the affixes too? however, i'm guessing most of these are already in lowercase.

        The way it works now is that we case fold the input word after affixes are applied, before comparing with dictionary words. So if either input word or affixes are not lower-case they will both be. We could add a test for it to make sure..

        are we "merging" dictionary entries here (I think we should in this lower-casing mode?)

        No, we are not, meaning, I guess, that Foo/B would overwrite foo/A in your example? Would you like to take a stab at the merging code?

        Show
        Jan Høydahl added a comment - Robert, in the part reading the dictionary, we should avoid the String.toLowerCase() without any locale here, at least use String.toLowerCase(Locale.ENGLISH) for consistency? Yep, for consistency it's probably better to lowercase using an explicit locale than system default. I tested with my name, and Locale.ENGLISH converts Ø->ø, so I'm happy shouldn't we case fold the affixes too? however, i'm guessing most of these are already in lowercase. The way it works now is that we case fold the input word after affixes are applied, before comparing with dictionary words. So if either input word or affixes are not lower-case they will both be. We could add a test for it to make sure.. are we "merging" dictionary entries here (I think we should in this lower-casing mode?) No, we are not, meaning, I guess, that Foo/B would overwrite foo/A in your example? Would you like to take a stab at the merging code?
        Hide
        Uwe Schindler added a comment -

        Locale.ENGLISH is the only possible solution, because this one is identical (or nearest) to LowerCaseFilter. If you use another locale, the lowercasing would be incompatible to LowerCaseFilter used in the Analyzer, so it will fail for some special locales (like Turkish).

        Show
        Uwe Schindler added a comment - Locale.ENGLISH is the only possible solution, because this one is identical (or nearest) to LowerCaseFilter. If you use another locale, the lowercasing would be incompatible to LowerCaseFilter used in the Analyzer, so it will fail for some special locales (like Turkish).
        Hide
        Robert Muir added a comment -

        So if either input word or affixes are not lower-case they will both be. We could add a test for it to make sure..

        Yeah, I'm just thinking we shouldn't rely upon the dictionary to have lower-case affixes.

        No, we are not, meaning, I guess, that Foo/B would overwrite foo/A in your example? Would you like to take a stab at the merging code?

        Ok, ill give it a try. I think its a good test to have anyway, i at least want to know what happens here

        Show
        Robert Muir added a comment - So if either input word or affixes are not lower-case they will both be. We could add a test for it to make sure.. Yeah, I'm just thinking we shouldn't rely upon the dictionary to have lower-case affixes. No, we are not, meaning, I guess, that Foo/B would overwrite foo/A in your example? Would you like to take a stab at the merging code? Ok, ill give it a try. I think its a good test to have anyway, i at least want to know what happens here
        Hide
        Robert Muir added a comment -

        guess i was crazy about the merging thing: i added tests for this and it works already as we want... oh well at least we have a test for it

        Show
        Robert Muir added a comment - guess i was crazy about the merging thing: i added tests for this and it works already as we want... oh well at least we have a test for it
        Hide
        Jan Høydahl added a comment -

        Fixed toLowerCase(Locale.ENGLISH).

        Getting close to committing?

        Show
        Jan Høydahl added a comment - Fixed toLowerCase(Locale.ENGLISH). Getting close to committing?
        Hide
        Jan Høydahl added a comment -

        Robert, do you have more changes you want in before commit?

        Show
        Jan Høydahl added a comment - Robert, do you have more changes you want in before commit?
        Hide
        Robert Muir added a comment -

        +1 to commit

        Show
        Robert Muir added a comment - +1 to commit
        Hide
        Jan Høydahl added a comment -

        Attaching final patch files

        Show
        Jan Høydahl added a comment - Attaching final patch files
        Hide
        Jan Høydahl added a comment -

        Fixed for 3.x and trunk. Wiki page updated.

        Show
        Jan Høydahl added a comment - Fixed for 3.x and trunk. Wiki page updated.
        Hide
        Uwe Schindler added a comment -

        Bulk close after 3.5 is released

        Show
        Uwe Schindler added a comment - Bulk close after 3.5 is released

          People

          • Assignee:
            Jan Høydahl
            Reporter:
            Jan Høydahl
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development