Solr
  1. Solr
  2. SOLR-2934

Problem with Solr Hunspell with French Dictionary

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 4.8, Trunk
    • Component/s: Schema and Analysis
    • Labels:
      None
    • Environment:

      Windows 7

      Description

      I'm trying to add the HunspellStemFilterFactory to my Solr project.
      I'm trying this on a fresh new download of Solr 3.5.
      I downloaded french dictionary here (found it from here): http://www.dicollecte.org/download/fr/hunspell-fr-moderne-v4.3.zip

      But when I start Solr and go to the Solr Analysis, an error occurs in Solr.

      Is there the trace :
      java.lang.RuntimeException: Unable to load hunspell data! [dictionary=en_GB.dic,affix=fr-moderne.aff]
      at org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:82)
      at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:546)
      at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:126)
      at org.apache.solr.core.CoreContainer.create(CoreContainer.java:461)
      at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316)
      at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)
      at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130)
      at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94)
      at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
      at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
      at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
      at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
      at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
      at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
      at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
      at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
      at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
      at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
      at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
      at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
      at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
      at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
      at org.mortbay.jetty.Server.doStart(Server.java:224)
      at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
      at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
      at java.lang.reflect.Method.invoke(Unknown Source)
      at org.mortbay.start.Main.invokeMain(Main.java:194)
      at org.mortbay.start.Main.start(Main.java:534)
      at org.mortbay.start.Main.start(Main.java:441)
      at org.mortbay.start.Main.main(Main.java:119)

      Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 3
      at java.lang.String.charAt(Unknown Source)
      at org.apache.lucene.analysis.hunspell.HunspellDictionary$DoubleASCIIFlagParsingStrategy.parseFlags(HunspellDictionary.java:382)
      at org.apache.lucene.analysis.hunspell.HunspellDictionary.parseAffix(HunspellDictionary.java:165)
      at org.apache.lucene.analysis.hunspell.HunspellDictionary.readAffixFile(HunspellDictionary.java:121)
      at org.apache.lucene.analysis.hunspell.HunspellDictionary.<init>(HunspellDictionary.java:64)
      at org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:46)

      I can't find where the problem is. It seems like my dictionary isn't well written for hunspell, but I tried with two different dictionaries, and I had the same problem.
      I also tried with an english dictionary, and ... it works !
      So I think that my french dictionary is wrong for hunspell, but I don't know why ...

      Can you help me ?

      1. en_GB.aff
        73 kB
        Markus Jelsma
      2. en_GB.dic
        685 kB
        Markus Jelsma

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Close issue after release of 4.8.0

          Show
          Uwe Schindler added a comment - Close issue after release of 4.8.0
          Hide
          Robert Muir added a comment -

          Stephan Meisinger added a comment - 16/Jul/12 05:05

          Please consider look at this again:
          I can reproduce the original StringOutOfBoundException in DoubleASCIIFlagParsingStrategy

          Just a followup about that issue with long flags, I found this in a thunderbird dictionary. The bug is not the flag parsing (again it should always be an even number of characters, i added an explicit check for that too!). Instead the bug was that escaping wasnt handled properly. So if the word itself contains a slash, some parts of the word would be bogusly parsed as flags. The escaping was fixed in LUCENE-5497.

          Show
          Robert Muir added a comment - Stephan Meisinger added a comment - 16/Jul/12 05:05 Please consider look at this again: I can reproduce the original StringOutOfBoundException in DoubleASCIIFlagParsingStrategy Just a followup about that issue with long flags, I found this in a thunderbird dictionary. The bug is not the flag parsing (again it should always be an even number of characters, i added an explicit check for that too!). Instead the bug was that escaping wasnt handled properly. So if the word itself contains a slash, some parts of the word would be bogusly parsed as flags. The escaping was fixed in LUCENE-5497 .
          Hide
          Robert Muir added a comment -

          Currently we can load all the openoffice dictionaries (at least from the old link).

          I will test newer dictionaries (thunderbird has a link) later today, especially since it has many that arent in the openoffice list. This might reveal some issues to fix.

          As far as EN_GB.aff,.dic, i committed a fix for this (we use mark/reset to go back once we find the encoding, for now, and i ensured it has a large enough buffer size).

          As far as the original exception reported by the user (mixing EN_GB.dic with french affix file, this is not supported. Affix files must "go with" the dictionary as they contain information such as how characters and flags are encoded).

          As far as Stephen's issue: with long flags, there should never be an odd number of flags. So something is wrong with the dictionary you are using. I haven't seen it yet in the wild with published dictionaries.

          Show
          Robert Muir added a comment - Currently we can load all the openoffice dictionaries (at least from the old link). I will test newer dictionaries (thunderbird has a link) later today, especially since it has many that arent in the openoffice list. This might reveal some issues to fix. As far as EN_GB.aff,.dic, i committed a fix for this (we use mark/reset to go back once we find the encoding, for now, and i ensured it has a large enough buffer size). As far as the original exception reported by the user (mixing EN_GB.dic with french affix file, this is not supported. Affix files must "go with" the dictionary as they contain information such as how characters and flags are encoded). As far as Stephen's issue: with long flags, there should never be an odd number of flags. So something is wrong with the dictionary you are using. I haven't seen it yet in the wild with published dictionaries.
          Hide
          ASF subversion and git services added a comment -

          Commit 1574159 from Robert Muir in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1574159 ]

          SOLR-2934: increase buffer size for recent dictionaries with large amounts of AF/AM lines before charset

          Show
          ASF subversion and git services added a comment - Commit 1574159 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1574159 ] SOLR-2934 : increase buffer size for recent dictionaries with large amounts of AF/AM lines before charset
          Hide
          ASF subversion and git services added a comment -

          Commit 1574158 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1574158 ]

          SOLR-2934: increase buffer size for recent dictionaries with large amounts of AF/AM lines before charset

          Show
          ASF subversion and git services added a comment - Commit 1574158 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1574158 ] SOLR-2934 : increase buffer size for recent dictionaries with large amounts of AF/AM lines before charset
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Robert Muir added a comment -

          moving all 4.0 issues not touched in a month to 4.1

          Show
          Robert Muir added a comment - moving all 4.0 issues not touched in a month to 4.1
          Hide
          Robert Muir added a comment -

          rmuir20120906-bulk-40-change

          Show
          Robert Muir added a comment - rmuir20120906-bulk-40-change
          Hide
          Chris Male added a comment -

          I will take a look Stephan.

          Show
          Chris Male added a comment - I will take a look Stephan.
          Hide
          Stephan Meisinger added a comment -

          Please consider look at this again:
          I can reproduce the original StringOutOfBoundException in DoubleASCIIFlagParsingStrategy

          I think this is caused by

          for (int i = 0; i < rawFlags.length(); i+=2) {
          char cookedFlag = (char) ((int) rawFlags.charAt + (int) rawFlags.charAt(i + 1)); // <<< i +1 here!!!

          we have used the dictonary with solr 3.3 (+patched with files from LUCENE-3414/SOLR-2769) changed this given line to

          for (int i = 0; i < rawFlags.length()-1; i+=2) { // <<< reduce size by 1 because of .charAt(i+1)
          char cookedFlag = (char) ((int) rawFlags.charAt + (int) rawFlags.charAt(i + 1));

          this worked flawless for us.

          Show
          Stephan Meisinger added a comment - Please consider look at this again: I can reproduce the original StringOutOfBoundException in DoubleASCIIFlagParsingStrategy I think this is caused by for (int i = 0; i < rawFlags.length(); i+=2) { char cookedFlag = (char) ((int) rawFlags.charAt + (int) rawFlags.charAt(i + 1)); // <<< i +1 here!!! we have used the dictonary with solr 3.3 (+patched with files from LUCENE-3414 / SOLR-2769 ) changed this given line to for (int i = 0; i < rawFlags.length()-1; i+=2) { // <<< reduce size by 1 because of .charAt(i+1) char cookedFlag = (char) ((int) rawFlags.charAt + (int) rawFlags.charAt(i + 1)); this worked flawless for us.
          Hide
          Hoss Man added a comment -

          bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

          Show
          Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
          Hide
          ludovic Boutros added a comment -

          I've attached a little patch in the other issue which allows me to load latest French dictionaries of OpenOffice.

          Show
          ludovic Boutros added a comment - I've attached a little patch in the other issue which allows me to load latest French dictionaries of OpenOffice.
          Hide
          ludovic Boutros added a comment -

          done : SOLR-3494.

          Show
          ludovic Boutros added a comment - done : SOLR-3494 .
          Hide
          ludovic Boutros added a comment -

          And just for information, the ubuntu french hunspell dictionary is not compressed and works perfectly.

          Show
          ludovic Boutros added a comment - And just for information, the ubuntu french hunspell dictionary is not compressed and works perfectly.
          Hide
          Chris Male added a comment - - edited

          Correct me if I'm wrong but it seems not possible to load compressed affix dictionaries currently.

          That is correct. Our Java code isn't a direct port from the C++ code, rather it's an implementation designed to suit our needs. It could definitely do with some love in regards to compressed dictionaries. Would you like to open an issue and throw together a patch?

          Show
          Chris Male added a comment - - edited Correct me if I'm wrong but it seems not possible to load compressed affix dictionaries currently. That is correct. Our Java code isn't a direct port from the C++ code, rather it's an implementation designed to suit our needs. It could definitely do with some love in regards to compressed dictionaries. Would you like to open an issue and throw together a patch?
          Hide
          ludovic Boutros added a comment -

          For the french dictionary for instance, if I understand the mechanism well,
          it seems that there are some aliases, i.e. "AF ...", "AM ...".
          These dictionaries are somehow compressed.

          And in the C++ code there is this part of code :

              dash = strchr(piece, '/');
              if (dash) {
                  ...
                  if (pHMgr->is_aliasf()) {
                    int index = atoi(dash + 1);
                    nptr->contclasslen = pHMgr->get_aliasf(index, &(nptr->contclass));
                  } else {
                      nptr->contclasslen = pHMgr->decode_flags(&(nptr->contclass), dash + 1);
                      flag_qsort(nptr->contclass, 0, nptr->contclasslen);
                  }
          

          But I did not find anything similar in the Java Class, the aliases are not loaded I think.
          Correct me if I'm wrong but it seems not possible to load compressed affix dictionaries currently.

          Hope this can help.

          Show
          ludovic Boutros added a comment - For the french dictionary for instance, if I understand the mechanism well, it seems that there are some aliases, i.e. "AF ...", "AM ...". These dictionaries are somehow compressed. And in the C++ code there is this part of code : dash = strchr(piece, '/'); if (dash) { ... if (pHMgr->is_aliasf()) { int index = atoi(dash + 1); nptr->contclasslen = pHMgr->get_aliasf(index, &(nptr->contclass)); } else { nptr->contclasslen = pHMgr->decode_flags(&(nptr->contclass), dash + 1); flag_qsort(nptr->contclass, 0, nptr->contclasslen); } But I did not find anything similar in the Java Class, the aliases are not loaded I think. Correct me if I'm wrong but it seems not possible to load compressed affix dictionaries currently. Hope this can help.
          Hide
          Bráulio Bhavamitra added a comment -
          Show
          Bráulio Bhavamitra added a comment - the same is happening with pt_BR dict http://artfiles.org/openoffice.org/contrib/dictionaries/pt_BR.zip
          Hide
          Robert Muir added a comment -

          I see, indeed this has no CHARSET line.

          I think the only solution is to allow the user to manually provide this as a parameter in such cases.

          Show
          Robert Muir added a comment - I see, indeed this has no CHARSET line. I think the only solution is to allow the user to manually provide this as a parameter in such cases.
          Hide
          Markus Jelsma added a comment -

          en_GB.aff and en_GB.dic files from openoffice.org.

          Show
          Markus Jelsma added a comment - en_GB.aff and en_GB.dic files from openoffice.org.
          Hide
          Robert Muir added a comment -

          When I click that page it just links to http://extensions.services.openoffice.org/en/project/dict-en-fixed and gives the same error.

          Can you upload your copy?

          Show
          Robert Muir added a comment - When I click that page it just links to http://extensions.services.openoffice.org/en/project/dict-en-fixed and gives the same error. Can you upload your copy?
          Hide
          Markus Jelsma added a comment -

          Indeed! Strange. If you go there via:
          http://extensions.services.openoffice.org/en/dictionaries

          and this anchor:
          English dictionaries with fixed dash handling and new ligature and phonetic suggestion support

          you'll end up on the same page without error.

          Show
          Markus Jelsma added a comment - Indeed! Strange. If you go there via: http://extensions.services.openoffice.org/en/dictionaries and this anchor: English dictionaries with fixed dash handling and new ligature and phonetic suggestion support you'll end up on the same page without error.
          Hide
          Robert Muir added a comment -

          That page doesnt work for me.

          So if you want bugs fixed with dictionaries, which anyone can make and can be buggy, you must upload
          them to this issue (don't check the box), because otherwise we have nothing to work with.

          Show
          Robert Muir added a comment - That page doesnt work for me. So if you want bugs fixed with dictionaries, which anyone can make and can be buggy, you must upload them to this issue (don't check the box), because otherwise we have nothing to work with.
          Hide
          Markus Jelsma added a comment -
          Caused by: java.text.ParseException: The first non-comment line in the affix file must be a 'SET charset', was: 'FLAG num'
                  at org.apache.lucene.analysis.hunspell.HunspellDictionary.getDictionaryEncoding(HunspellDictionary.java:262)
                  at org.apache.lucene.analysis.hunspell.HunspellDictionary.<init>(HunspellDictionary.java:101)
                  at org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:80)
                  ... 31 more
          

          This is thrown by the en_GB available at OpenOffice.
          http://extensions.services.openoffice.org/en/project/dict-en-fixed

          Show
          Markus Jelsma added a comment - Caused by: java.text.ParseException: The first non-comment line in the affix file must be a 'SET charset', was: 'FLAG num' at org.apache.lucene.analysis.hunspell.HunspellDictionary.getDictionaryEncoding(HunspellDictionary.java:262) at org.apache.lucene.analysis.hunspell.HunspellDictionary.<init>(HunspellDictionary.java:101) at org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:80) ... 31 more This is thrown by the en_GB available at OpenOffice. http://extensions.services.openoffice.org/en/project/dict-en-fixed
          Hide
          Markus Jelsma added a comment -

          I'm sorry, it means the packages available at OpenOffice.

          Show
          Markus Jelsma added a comment - I'm sorry, it means the packages available at OpenOffice.
          Hide
          Robert Muir added a comment -

          Where did you get your en_GB dictionary? The one from openoffice has as first line 'SET ISO8859-1'.
          So if you want bugs fixed with dictionaries, which anyone can make and can be buggy, you must upload
          them to this issue (don't check the box), because otherwise we have nothing to work with.

          There is no point in worrying about adding aliases/matching up charset naming for Thai.
          The thai spell dictionary is just a list of words (nothing in the .aff except 4 replacement rules
          for spellchecking), so this whole filter will be a no-op with that dictionary.

          SET TIS620-2533
          
          REP 4
          REP ทร ซ
          REP ซ ทร
          REP ส ซ
          REP ซ ส
          
          Show
          Robert Muir added a comment - Where did you get your en_GB dictionary? The one from openoffice has as first line 'SET ISO8859-1'. So if you want bugs fixed with dictionaries, which anyone can make and can be buggy, you must upload them to this issue (don't check the box), because otherwise we have nothing to work with. There is no point in worrying about adding aliases/matching up charset naming for Thai. The thai spell dictionary is just a list of words (nothing in the .aff except 4 replacement rules for spellchecking), so this whole filter will be a no-op with that dictionary. SET TIS620-2533 REP 4 REP ทร ซ REP ซ ทร REP ส ซ REP ซ ส
          Hide
          Robert Muir added a comment -

          It seems there are many different issues with the provided dic and aff files and some seem to work and some don't seem to work at all.

          What does this mean? We don't provide any dic and aff files!

          Show
          Robert Muir added a comment - It seems there are many different issues with the provided dic and aff files and some seem to work and some don't seem to work at all. What does this mean? We don't provide any dic and aff files!
          Hide
          Markus Jelsma added a comment -

          We've seen issues with quite a few dic files as well but the stacktrace makes it difficult to find the error. NumberFormatExceptions (da_DK) are easy as they print the bad number but ArrayOutOfBoundExceptions (nl_NL) are almost impossible to debug. We also see ParseExceptions such as (en_GB):

          Caused by: java.text.ParseException: The first non-comment line in the affix file must be a 'SET charset', was: 'FLAG num'
          at org.apache.lucene.analysis.hunspell.HunspellDictionary.getDictionaryEncoding(HunspellDictionary.java:262)
          at org.apache.lucene.analysis.hunspell.HunspellDictionary.<init>(HunspellDictionary.java:101)
          at org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:80)
          ... 31 more

          and UnsupportedCharsetException (th_TH):

          Caused by: java.nio.charset.UnsupportedCharsetException: TIS620-2533
          at java.nio.charset.Charset.forName(Charset.java:505)
          at org.apache.lucene.analysis.hunspell.HunspellDictionary.getJavaEncoding(HunspellDictionary.java:275)
          at org.apache.lucene.analysis.hunspell.HunspellDictionary.<init>(HunspellDictionary.java:102)
          at org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:80)
          ... 31 more

          It seems there are many different issues with the provided dic and aff files and some seem to work and some don't seem to work at all.

          Show
          Markus Jelsma added a comment - We've seen issues with quite a few dic files as well but the stacktrace makes it difficult to find the error. NumberFormatExceptions (da_DK) are easy as they print the bad number but ArrayOutOfBoundExceptions (nl_NL) are almost impossible to debug. We also see ParseExceptions such as (en_GB): Caused by: java.text.ParseException: The first non-comment line in the affix file must be a 'SET charset', was: 'FLAG num' at org.apache.lucene.analysis.hunspell.HunspellDictionary.getDictionaryEncoding(HunspellDictionary.java:262) at org.apache.lucene.analysis.hunspell.HunspellDictionary.<init>(HunspellDictionary.java:101) at org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:80) ... 31 more and UnsupportedCharsetException (th_TH): Caused by: java.nio.charset.UnsupportedCharsetException: TIS620-2533 at java.nio.charset.Charset.forName(Charset.java:505) at org.apache.lucene.analysis.hunspell.HunspellDictionary.getJavaEncoding(HunspellDictionary.java:275) at org.apache.lucene.analysis.hunspell.HunspellDictionary.<init>(HunspellDictionary.java:102) at org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:80) ... 31 more It seems there are many different issues with the provided dic and aff files and some seem to work and some don't seem to work at all.
          Hide
          Chris Male added a comment -

          Great idea

          Show
          Chris Male added a comment - Great idea
          Hide
          Robert Muir added a comment -

          Might also be a good idea to document somewhere that not all languages' encodings work correctly at the moment.

          Some of these are crazy-complicated (e.g. hungarian).

          Show
          Robert Muir added a comment - Might also be a good idea to document somewhere that not all languages' encodings work correctly at the moment. Some of these are crazy-complicated (e.g. hungarian).
          Hide
          Chris Male added a comment -

          No worries

          Show
          Chris Male added a comment - No worries
          Hide
          Erick Erickson added a comment -

          Yep, I just saw that and tried to re-open the issue but you beat me to it! I should probably read the user's list before the dev list each morning!

          My mistake.

          Show
          Erick Erickson added a comment - Yep, I just saw that and tried to re-open the issue but you beat me to it! I should probably read the user's list before the dev list each morning! My mistake.
          Hide
          Chris Male added a comment -

          Hey Erick,

          I asked Nathan to open this issue as he reported it on the mailing list. I want to evaluate whether the problem is in our codebase.

          Show
          Chris Male added a comment - Hey Erick, I asked Nathan to open this issue as he reported it on the mailing list. I want to evaluate whether the problem is in our codebase.
          Hide
          Erick Erickson added a comment -

          Please raise issues like this on the solr user's list (solr-user@lucene.apache.org) first rather than raising a JIRA, JIRAs are for bugs/improvements rather than this kind of issue. If the users list results indicate a problem (rather than a user/data error) exists, then feel free to rais a JIRA.

          Show
          Erick Erickson added a comment - Please raise issues like this on the solr user's list (solr-user@lucene.apache.org) first rather than raising a JIRA, JIRAs are for bugs/improvements rather than this kind of issue. If the users list results indicate a problem (rather than a user/data error) exists, then feel free to rais a JIRA.

            People

            • Assignee:
              Chris Male
              Reporter:
              Nathan Castelein
            • Votes:
              4 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development