Lucene - Core
  1. Lucene - Core
  2. LUCENE-4019

Parsing Hunspell affix rules without regexp condition

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6
    • Fix Version/s: 4.0-ALPHA, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We found out that some recent Dutch hunspell dictionaries contain suffix or prefix rules like the following:

       
      SFX Na N 1
      SFX Na 0 ste
      

      The rule on the second line doesn't contain the 5th parameter, which should be the condition (a regexp usually). You can usually see a '.' as condition, meaning always (for every character). As explained in LUCENE-3976 the readAffix method throws error. I wonder if we should treat the missing value as a kind of default value, like '.'. On the other hand I haven't found any information about this within the spec. Any thoughts?

      1. LUCENE-4019.patch
        10 kB
        Luca Cavanna
      2. LUCENE-4019.patch
        9 kB
        Luca Cavanna
      3. LUCENE-4019.patch
        3 kB
        Luca Cavanna

        Activity

        Hide
        Robert Muir added a comment -

        I dont know if there is a real spec, more just what hunspell allows.

        Furthermore i think some of these dictionaries are actually in ispell/myspell format
        and hunspell is actually backwards compatible with them?

        as far as a "spec" for all of these, good luck

        when i was looking at this I looked at stuff like:

        Show
        Robert Muir added a comment - I dont know if there is a real spec, more just what hunspell allows. Furthermore i think some of these dictionaries are actually in ispell/myspell format and hunspell is actually backwards compatible with them? as far as a "spec" for all of these, good luck when i was looking at this I looked at stuff like: http://pwet.fr/man/linux/fichiers_speciaux/hunspell http://www.openoffice.org/lingucomponent/affix.readme
        Hide
        Luca Cavanna added a comment -

        Robert, with "spec" I meant exactly your links
        Actually it's clear that the affix header has 4 elements while each rule has at least 5 elements. I don't really know what hunspell does with that kind of malformed rules. Lucene just throws an error while loading the dictionary. Looking at the hunspell source code, I might be wrong but I suspect it just skips that specific rule with some warning. But honestly it's hard to believe that at least 4 dictionaries I tried contain mistaken rules, isn't it? I'll investigate more, thanks!

        Show
        Luca Cavanna added a comment - Robert, with "spec" I meant exactly your links Actually it's clear that the affix header has 4 elements while each rule has at least 5 elements. I don't really know what hunspell does with that kind of malformed rules. Lucene just throws an error while loading the dictionary. Looking at the hunspell source code, I might be wrong but I suspect it just skips that specific rule with some warning. But honestly it's hard to believe that at least 4 dictionaries I tried contain mistaken rules, isn't it? I'll investigate more, thanks!
        Hide
        Robert Muir added a comment -

        its tough to know for sure. in general a lot of hunspell dictionaries cannot be parsed.
        There are a ton of these, under many strange licenses and they are very large.

        A "Test scaffolding" of sorts could probably be done to hunt out problems:

        • download all dictionaries you can find
        • for each one, use hunspell command-line tools like munch, unmunch (which applies all the rules), etc
          to generate some sort of expected output in .txt format.
        • for each one, do the same using the hunspell parsing here.
        • compare results: when things differ, try to boil it down to a compact .aff/.dic, with a test case and fix and commit.
        Show
        Robert Muir added a comment - its tough to know for sure. in general a lot of hunspell dictionaries cannot be parsed. There are a ton of these, under many strange licenses and they are very large. A "Test scaffolding" of sorts could probably be done to hunt out problems: download all dictionaries you can find for each one, use hunspell command-line tools like munch, unmunch (which applies all the rules), etc to generate some sort of expected output in .txt format. for each one, do the same using the hunspell parsing here. compare results: when things differ, try to boil it down to a compact .aff/.dic, with a test case and fix and commit.
        Hide
        Luca Cavanna added a comment -

        Thank you Robert for the explanation!
        In this specific case it's hard to understand the differences between hunspell and Lucene, since Lucene doesn't even parse the affix file.
        I've been in contact with the authors of those Ducth dictionaries, as well as with the hunspell author. It turned out that those affix rules are wrong and hunspell actually ignores them. I think it's better to ignore them in Lucene too, rather than throwing an exception, which makes impossible to use those dictionaries at all.

        Show
        Luca Cavanna added a comment - Thank you Robert for the explanation! In this specific case it's hard to understand the differences between hunspell and Lucene, since Lucene doesn't even parse the affix file. I've been in contact with the authors of those Ducth dictionaries, as well as with the hunspell author. It turned out that those affix rules are wrong and hunspell actually ignores them. I think it's better to ignore them in Lucene too, rather than throwing an exception, which makes impossible to use those dictionaries at all.
        Hide
        Luca Cavanna added a comment -

        Small patch: affix rules with less than 5 elements are now ignored. I added a specific test with a new affix file containing an example of rule shorter than it should be. Let me know if you prefer to add a warning when a rule is skipped. Hunspell does that only with a specific command line option.

        Show
        Luca Cavanna added a comment - Small patch: affix rules with less than 5 elements are now ignored. I added a specific test with a new affix file containing an example of rule shorter than it should be. Let me know if you prefer to add a warning when a rule is skipped. Hunspell does that only with a specific command line option.
        Hide
        Chris Male added a comment -

        Hi Luca,

        Sorry for taking so long to get to this. Patch looks good and seems to fix the problem. I think we do need some way to force 'strict' parsing of the files. Do you think you can add a option for that? When strict parsing is enabled, lines without the expected number of elements cause an error.

        We can even have this enabled by default so users have to explicitly say that they know the dictionary doesn't conform to our standard and are okay with us silently ignoring bad rules.

        Show
        Chris Male added a comment - Hi Luca, Sorry for taking so long to get to this. Patch looks good and seems to fix the problem. I think we do need some way to force 'strict' parsing of the files. Do you think you can add a option for that? When strict parsing is enabled, lines without the expected number of elements cause an error. We can even have this enabled by default so users have to explicitly say that they know the dictionary doesn't conform to our standard and are okay with us silently ignoring bad rules.
        Hide
        Luca Cavanna added a comment -

        Hi Chris,
        thanks for your feedback. Here is a new patch containing a new option in order to enable/disable the affix strict parsing, by default it is enabled. I updated the HunspellStemFilterFactory too in order to expose the new option to Solr.

        Show
        Luca Cavanna added a comment - Hi Chris, thanks for your feedback. Here is a new patch containing a new option in order to enable/disable the affix strict parsing, by default it is enabled. I updated the HunspellStemFilterFactory too in order to expose the new option to Solr.
        Hide
        Chris Male added a comment -

        Hi Luca,

        Thanks for taking a shot at this.

        I wonder whether we can do improve the ParseException message? At the very least it should include the line that is causing the problem so people can find it. What would be even better is if we also included the line number. The latter is probably not so urgent, but it would be handy to have for other parsing errors too.

        Also I think the changes to the Factory are wrong:

        +      if(strictAffixParsing.equalsIgnoreCase(TRUE)) ignoreCase = true;
        +      else if(strictAffixParsing.equalsIgnoreCase(FALSE)) ignoreCase = false;
        
        Show
        Chris Male added a comment - Hi Luca, Thanks for taking a shot at this. I wonder whether we can do improve the ParseException message? At the very least it should include the line that is causing the problem so people can find it. What would be even better is if we also included the line number. The latter is probably not so urgent, but it would be handy to have for other parsing errors too. Also I think the changes to the Factory are wrong: + if (strictAffixParsing.equalsIgnoreCase(TRUE)) ignoreCase = true ; + else if (strictAffixParsing.equalsIgnoreCase(FALSE)) ignoreCase = false ;
        Hide
        Luca Cavanna added a comment -

        Yeah, sorry for my mistakes, I corrected them.
        And I added the line number to the ParseException.
        Let me know if there's something more I can do!

        Show
        Luca Cavanna added a comment - Yeah, sorry for my mistakes, I corrected them. And I added the line number to the ParseException. Let me know if there's something more I can do!
        Hide
        Chris Male added a comment -

        Thanks Luca!

        Show
        Chris Male added a comment - Thanks Luca!

          People

          • Assignee:
            Chris Male
            Reporter:
            Luca Cavanna
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development