Lucene - Core
  1. Lucene - Core
  2. LUCENE-4542

Make RECURSION_CAP in HunspellStemmer configurable

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.4, 5.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently there is
      private static final int RECURSION_CAP = 2;
      in the code of the class HunspellStemmer. It makes using hunspell with several dictionaries almost unusable, due to bad performance (f.ex. it costs 36ms to stem long sentence in latvian for recursion_cap=2 and 5 ms for recursion_cap=1). It would be nice to be able to tune this number as needed.
      AFAIK this number (2) was chosen arbitrary.

      (it's a first issue in my life, so please forgive me any mistakes done).

      1. Lucene-4542-javadoc.patch
        8 kB
        Rafał Kuć
      2. LUCENE-4542-with-solr.patch
        7 kB
        Rafał Kuć
      3. LUCENE-4542.patch
        2 kB
        Rafał Kuć

        Issue Links

          Activity

          Hide
          Chris Male added a comment -

          The recursion cap shouldn't actually exist, it isn't found in the formal description of the Hunspell algorithm. Instead theoretically the dictionaries themselves should ensure that correct suggestions are generated. However while testing our implementation we discovered that a number of dictionaries get into infinite loops of suffix removals and additions, consequently we added the cap to prevent this.

          I think it would be risky to tamper heavily with the algorithm as it is, since any improvements will help some dictionaries while degrading others. Instead the purpose of Hunspell is to decouple the stemming algorithm for the languages themselves, so the focus should by and large be on improving the dictionaries.

          Show
          Chris Male added a comment - The recursion cap shouldn't actually exist, it isn't found in the formal description of the Hunspell algorithm. Instead theoretically the dictionaries themselves should ensure that correct suggestions are generated. However while testing our implementation we discovered that a number of dictionaries get into infinite loops of suffix removals and additions, consequently we added the cap to prevent this. I think it would be risky to tamper heavily with the algorithm as it is, since any improvements will help some dictionaries while degrading others. Instead the purpose of Hunspell is to decouple the stemming algorithm for the languages themselves, so the focus should by and large be on improving the dictionaries.
          Hide
          Lukas Vlcek added a comment -

          IIRC the hunspell stemmer works basically the following way:

          1. Assuming input token is not a root form of the word it scans affix rules (.aff file) and try to identify possible rules that could have been used to produce the input token.
          2. Apply each found rule to the input token to get one or more output tokens. The output tokens can be considered candidates for the word in root form.
          3. If any of the candidates is found in the dictionary (.dic file) and application of particular rule is allowed (see the regexp pattern in .aff file) then bingo! If not goto #1 until RECURSION_CAP level is reached.

          This way you can have `nongoodnesses` stemmed to `good` (providing RECURSION_CAP=2). Depending on the dictionary and affix rules you may need one pass to get from `nongoodnesses` to `goodnesses` and then two other passes to get from `goodnesses` to `goodness` and then from `goodness` to `good`. (Probably not the best example)

          However, this is all very depending on particular dictionary and affix rules.

          For example I realized that czech (ispell) or slovak (hunspell) dictionaries are constructed in a different way (though still a way that feels natural to the language itself) and only a single pass works best for them (although single pass does not allow for handling both prefix AND suffix at the same time).

          In my opinion there is a lot that could be improved in the hunspell token filter, but it is more linguistic matter then algorithmic.

          Show
          Lukas Vlcek added a comment - IIRC the hunspell stemmer works basically the following way: 1. Assuming input token is not a root form of the word it scans affix rules (.aff file) and try to identify possible rules that could have been used to produce the input token. 2. Apply each found rule to the input token to get one or more output tokens. The output tokens can be considered candidates for the word in root form. 3. If any of the candidates is found in the dictionary (.dic file) and application of particular rule is allowed (see the regexp pattern in .aff file) then bingo! If not goto #1 until RECURSION_CAP level is reached. This way you can have `nongoodnesses` stemmed to `good` (providing RECURSION_CAP=2). Depending on the dictionary and affix rules you may need one pass to get from `nongoodnesses` to `goodnesses` and then two other passes to get from `goodnesses` to `goodness` and then from `goodness` to `good`. (Probably not the best example) However, this is all very depending on particular dictionary and affix rules. For example I realized that czech (ispell) or slovak (hunspell) dictionaries are constructed in a different way (though still a way that feels natural to the language itself) and only a single pass works best for them (although single pass does not allow for handling both prefix AND suffix at the same time). In my opinion there is a lot that could be improved in the hunspell token filter, but it is more linguistic matter then algorithmic.
          Hide
          Erlend Garåsen added a comment -

          Can someone please explain the purpose of recursionCap? So far I understand that it makes it possible to adjust a so-called recursion depth, but I'm not sure I understand what it really does. I had some problems with odd suggestions for the Norwegian language, but changing the recursionCap to 0 did not help.

          Show
          Erlend Garåsen added a comment - Can someone please explain the purpose of recursionCap? So far I understand that it makes it possible to adjust a so-called recursion depth, but I'm not sure I understand what it really does. I had some problems with odd suggestions for the Norwegian language, but changing the recursionCap to 0 did not help.
          Jan Høydahl made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Jan Høydahl added a comment -

          Closing for 4.4. Please reopen if this is wrong, but the code and changes entry were committed for 4.4 so everything should be fine.

          Show
          Jan Høydahl added a comment - Closing for 4.4. Please reopen if this is wrong, but the code and changes entry were committed for 4.4 so everything should be fine.
          Jan Høydahl made changes -
          Fix Version/s 4.4 [ 12324323 ]
          Fix Version/s 4.5 [ 12324742 ]
          Jan Høydahl made changes -
          Link This issue is duplicated by LUCENE-4311 [ LUCENE-4311 ]
          Hide
          Jan Høydahl added a comment -

          I believe this should be closed and set fix-version to 4.4 again?

          Show
          Jan Høydahl added a comment - I believe this should be closed and set fix-version to 4.4 again?
          Steve Rowe made changes -
          Fix Version/s 4.5 [ 12324742 ]
          Fix Version/s 4.4 [ 12324323 ]
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          ASF subversion and git services added a comment -

          Commit 1504571 from Steve Rowe in branch 'dev/branches/lucene_solr_4_4'
          [ https://svn.apache.org/r1504571 ]

          LUCENE-4542: Make HunspellStemFilter's maximum recursion level configurable & move CHANGES.txt entry to the 4.4 section (merged trunk r1504529 & r1504561)

          Show
          ASF subversion and git services added a comment - Commit 1504571 from Steve Rowe in branch 'dev/branches/lucene_solr_4_4' [ https://svn.apache.org/r1504571 ] LUCENE-4542 : Make HunspellStemFilter's maximum recursion level configurable & move CHANGES.txt entry to the 4.4 section (merged trunk r1504529 & r1504561)
          Hide
          ASF subversion and git services added a comment -

          Commit 1504563 from Steve Rowe in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1504563 ]

          LUCENE-4542: move CHANGES.txt entry to the 4.4 section (merged trunk r1504561)

          Show
          ASF subversion and git services added a comment - Commit 1504563 from Steve Rowe in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1504563 ] LUCENE-4542 : move CHANGES.txt entry to the 4.4 section (merged trunk r1504561)
          Hide
          ASF subversion and git services added a comment -

          Commit 1504561 from Steve Rowe in branch 'dev/trunk'
          [ https://svn.apache.org/r1504561 ]

          LUCENE-4542: move CHANGES.txt entry to the 4.4 section

          Show
          ASF subversion and git services added a comment - Commit 1504561 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1504561 ] LUCENE-4542 : move CHANGES.txt entry to the 4.4 section
          Steve Rowe made changes -
          Fix Version/s 5.0 [ 12321663 ]
          Fix Version/s 4.4 [ 12324323 ]
          Fix Version/s 4.5 [ 12324742 ]
          Steve Rowe made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Assignee Adrien Grand [ jpountz ] Steve Rowe [ steve_rowe ]
          Hide
          Steve Rowe added a comment -

          Reopening to backport to 4.4, based on conversation with Adrien on #lucene-dev IRC.

          Show
          Steve Rowe added a comment - Reopening to backport to 4.4, based on conversation with Adrien on #lucene-dev IRC.
          Adrien Grand made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 4.5 [ 12324742 ]
          Resolution Fixed [ 1 ]
          Hide
          Adrien Grand added a comment -

          Committed, thanks!

          Show
          Adrien Grand added a comment - Committed, thanks!
          Hide
          ASF subversion and git services added a comment -

          Commit 1504531 from Adrien Grand in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1504531 ]

          LUCENE-4542: Make HunspellStemFilter's maximum recursion level configurable.

          Show
          ASF subversion and git services added a comment - Commit 1504531 from Adrien Grand in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1504531 ] LUCENE-4542 : Make HunspellStemFilter's maximum recursion level configurable.
          Hide
          ASF subversion and git services added a comment -

          Commit 1504529 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1504529 ]

          LUCENE-4542: Make HunspellStemFilter's maximum recursion level configurable.

          Show
          ASF subversion and git services added a comment - Commit 1504529 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1504529 ] LUCENE-4542 : Make HunspellStemFilter's maximum recursion level configurable.
          Adrien Grand made changes -
          Assignee Chris Male [ cmale ] Adrien Grand [ jpountz ]
          Hide
          Lukas Vlcek added a comment -

          I think it would be nice to have this merged so we can try and fix https://issues.apache.org/jira/browse/LUCENE-4311 as well (this ticket seems to be related).

          Show
          Lukas Vlcek added a comment - I think it would be nice to have this merged so we can try and fix https://issues.apache.org/jira/browse/LUCENE-4311 as well (this ticket seems to be related).
          Hide
          Rafał Kuć added a comment -

          +1 from me also. Chris Male do we need updates on the patches or they are OK?

          Show
          Rafał Kuć added a comment - +1 from me also. Chris Male do we need updates on the patches or they are OK?
          Hide
          Lukas Vlcek added a comment -

          +1 on having this merged into Lucene

          Show
          Lukas Vlcek added a comment - +1 on having this merged into Lucene
          Hide
          Chris Male added a comment -

          It looks great, thanks! I'll take care of it soon.

          Show
          Chris Male added a comment - It looks great, thanks! I'll take care of it soon.
          Hide
          Rafał Kuć added a comment -

          Chris anything else should be done here in your opinion or is it ready to be committed ?

          Show
          Rafał Kuć added a comment - Chris anything else should be done here in your opinion or is it ready to be committed ?
          Rafał Kuć made changes -
          Attachment Lucene-4542-javadoc.patch [ 12552642 ]
          Hide
          Rafał Kuć added a comment -

          Attached the patch with improved javadocs and recursionCap marked as final in the HunspellStemmer.

          Show
          Rafał Kuć added a comment - Attached the patch with improved javadocs and recursionCap marked as final in the HunspellStemmer.
          Hide
          Chris Male added a comment -

          Rafał,

          Thanks for creating the patches, they are looking great. Couple of very small improvements:

          • Can we mark recursionCap as final?
          • Can we improve the javadoc for the recursionCap parameter so it's clear what purpose it serves?
          • Maybe also drop in a comment at the field about how the recursion cap of 2 is the default value based on documentation about Hunspell (as opposed to something we arbitrarily chose).
          Show
          Chris Male added a comment - Rafał, Thanks for creating the patches, they are looking great. Couple of very small improvements: Can we mark recursionCap as final? Can we improve the javadoc for the recursionCap parameter so it's clear what purpose it serves? Maybe also drop in a comment at the field about how the recursion cap of 2 is the default value based on documentation about Hunspell (as opposed to something we arbitrarily chose).
          Rafał Kuć made changes -
          Attachment LUCENE-4542-with-solr.patch [ 12552464 ]
          Hide
          Rafał Kuć added a comment -

          Chris I've attached a second patch which includes changes to Solr HunspellFilter and its factory. Please review it and say if you want any changes to be made to it. I'll be glad to do it.

          Show
          Rafał Kuć added a comment - Chris I've attached a second patch which includes changes to Solr HunspellFilter and its factory. Please review it and say if you want any changes to be made to it. I'll be glad to do it.
          Rafał Kuć made changes -
          Attachment LUCENE-4542.patch [ 12552461 ]
          Hide
          Rafał Kuć added a comment -

          As Piotr doesn't want to provide the patch, I'll do it for him Simple patch adding a new constructor that allows to pass additional parameter - the recursion cap. The old constructor is there and the default value for recursion cap is 2.

          Show
          Rafał Kuć added a comment - As Piotr doesn't want to provide the patch, I'll do it for him Simple patch adding a new constructor that allows to pass additional parameter - the recursion cap. The old constructor is there and the default value for recursion cap is 2.
          Hide
          Piotr added a comment - - edited

          I'd prefer not to create a patch myself, I don't feel so comfortable with lucene code.

          Show
          Piotr added a comment - - edited I'd prefer not to create a patch myself, I don't feel so comfortable with lucene code.
          Chris Male made changes -
          Assignee Chris Male [ cmale ]
          Hide
          Chris Male added a comment -

          +1 I absolutely agree we need to make this change. There is another issue (I can't remember what just yet and I'm using a bad connection) where the recursion cap was causing analysis loops.

          Do you want to create a patch? We need to maintain backwards compatibility so the default experience should be using RECURSION_CAP as it is today. However users should be able to pass in a value as well (that includes the HunspellStemFilterFactory).

          Show
          Chris Male added a comment - +1 I absolutely agree we need to make this change. There is another issue (I can't remember what just yet and I'm using a bad connection) where the recursion cap was causing analysis loops. Do you want to create a patch? We need to maintain backwards compatibility so the default experience should be using RECURSION_CAP as it is today. However users should be able to pass in a value as well (that includes the HunspellStemFilterFactory).
          Piotr made changes -
          Description Currently there is
          private static final int RECURSION_CAP = 2;
          in the code of the class HunspellStemmer. It makes using hunspell with several dictionaries almost unusable, due to bad performance. It would be nice to be able to tune this number as needed.

          (it's a first issue in my life, so please forgive me any mistakes done).
          Currently there is
          private static final int RECURSION_CAP = 2;
          in the code of the class HunspellStemmer. It makes using hunspell with several dictionaries almost unusable, due to bad performance (f.ex. it costs 36ms to stem long sentence in latvian for recursion_cap=2 and 5 ms for recursion_cap=1). It would be nice to be able to tune this number as needed.
          AFAIK this number (2) was chosen arbitrary.

          (it's a first issue in my life, so please forgive me any mistakes done).
          Priority Minor [ 4 ] Major [ 3 ]
          Piotr made changes -
          Field Original Value New Value
          Description Currently there is
          private static final int RECURSION_CAP = 2;
          in the code. It makes using hunspell with several dictionaries almost unusable, due to bad performance. It would be nice to be able to tune this number as needed.

          (it's a first issue in my life, so please forgive me any mistakes done).
          Currently there is
          private static final int RECURSION_CAP = 2;
          in the code of the class HunspellStemmer. It makes using hunspell with several dictionaries almost unusable, due to bad performance. It would be nice to be able to tune this number as needed.

          (it's a first issue in my life, so please forgive me any mistakes done).
          Piotr created issue -

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Piotr
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development