Lucene - Core
  1. Lucene - Core
  2. LUCENE-3414

Bring Hunspell for Lucene into analysis module

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Some time ago I along with Robert and Uwe, wrote an Stemmer which uses the Hunspell algorithm. It has the benefit of supporting dictionaries for a wide array of languages.

      It seems to still be being used but has fallen out of date. I think it would benefit from being inside the analysis module where additional features such as decompounding support, could be added.

      1. LUCENE-3414.patch
        47 kB
        Chris Male
      2. LUCENE-3414.patch
        46 kB
        Chris Male

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Bulk close after release of 3.5

          Show
          Uwe Schindler added a comment - Bulk close after release of 3.5
          Hide
          Jan Høydahl added a comment -
          Show
          Jan Høydahl added a comment - SOLR-2769
          Hide
          Chris Male added a comment -

          Nope, its on my mental TODO but go for it.

          Show
          Chris Male added a comment - Nope, its on my mental TODO but go for it.
          Hide
          Jan Høydahl added a comment -

          Is there a JIRA for adding HunspellStemFilterFactory to Solr?

          Show
          Jan Høydahl added a comment - Is there a JIRA for adding HunspellStemFilterFactory to Solr?
          Hide
          Chris Male added a comment -

          3x back port:

          Committed revision 1167505.

          Show
          Chris Male added a comment - 3x back port: Committed revision 1167505.
          Hide
          Chris Male added a comment -

          Reopening for 3x backport.

          Show
          Chris Male added a comment - Reopening for 3x backport.
          Hide
          Chris Male added a comment -

          Committed revision 1167467.

          Show
          Chris Male added a comment - Committed revision 1167467.
          Hide
          Chris Male added a comment -

          Patch now includes a package.html linking to a PDF about hunspell and suggesting dictionaries are sourced from the OpenOffice wiki.

          Committing tomorrow.

          Show
          Chris Male added a comment - Patch now includes a package.html linking to a PDF about hunspell and suggesting dictionaries are sourced from the OpenOffice wiki. Committing tomorrow.
          Hide
          Robert Muir added a comment -

          I don't think we should do anything with the dictionaries ever, its much better to make small "test" dictionaries that are actually more like unit tests and test certain things, like what you did in the patch.

          Show
          Robert Muir added a comment - I don't think we should do anything with the dictionaries ever, its much better to make small "test" dictionaries that are actually more like unit tests and test certain things, like what you did in the patch.
          Hide
          Chris Male added a comment -

          Okay good spotting. so how do we want to proceed? Do we want to bring some of the dictionaries in? Should we address that in a later issue once its become clearer in OO what they're doing?

          Show
          Chris Male added a comment - Okay good spotting. so how do we want to proceed? Do we want to bring some of the dictionaries in? Should we address that in a later issue once its become clearer in OO what they're doing?
          Hide
          Robert Muir added a comment -

          Bizarrely, from what I can see in the OpenOffice SVN, they are still under their original license.

          I don't think we should read too much into that text file: its not even obvious which of the many dictionaries in that folder it applies to!

          I know for a fact that some of the files in there are NOT GPL, for example the en_US dictionary: http://svn.apache.org/viewvc/incubator/ooo/trunk/main/dictionaries/en/README_en_US.txt?revision=1162288&view=markup

          Show
          Robert Muir added a comment - Bizarrely, from what I can see in the OpenOffice SVN, they are still under their original license. I don't think we should read too much into that text file: its not even obvious which of the many dictionaries in that folder it applies to! I know for a fact that some of the files in there are NOT GPL, for example the en_US dictionary: http://svn.apache.org/viewvc/incubator/ooo/trunk/main/dictionaries/en/README_en_US.txt?revision=1162288&view=markup
          Hide
          Chris Male added a comment -

          how is OpenOffice dealing with those dictionaries since they are now an ASF incubation project? Maybe the dictionaries are under ASL eventually?

          Bizarrely, from what I can see in the OpenOffice SVN, they are still under their original license. I guess thats something they will have to sort out during incubation.

          I don't see the licenses changing since the dictionaries tend to be developed by national language organisations, but maybe the ASF will negotiate.

          Show
          Chris Male added a comment - how is OpenOffice dealing with those dictionaries since they are now an ASF incubation project? Maybe the dictionaries are under ASL eventually? Bizarrely, from what I can see in the OpenOffice SVN , they are still under their original license. I guess thats something they will have to sort out during incubation. I don't see the licenses changing since the dictionaries tend to be developed by national language organisations, but maybe the ASF will negotiate.
          Hide
          Simon Willnauer added a comment -

          ...so it should really be in Lucene, except the dictionaries.

          how is OpenOffice dealing with those dictionaries since they are now an ASF incubation project? Maybe the dictionaries are under ASL eventually?

          Show
          Simon Willnauer added a comment - ...so it should really be in Lucene, except the dictionaries. how is OpenOffice dealing with those dictionaries since they are now an ASF incubation project? Maybe the dictionaries are under ASL eventually?
          Hide
          Uwe Schindler added a comment -

          Thanks Chris for adding this to Lucene Analysis module. We did lots of work on Google Code, so it should really be in Lucene, except the dictionaries. We should only add links to web pages where to get them.

          Show
          Uwe Schindler added a comment - Thanks Chris for adding this to Lucene Analysis module. We did lots of work on Google Code, so it should really be in Lucene, except the dictionaries. We should only add links to web pages where to get them.
          Hide
          Chris Male added a comment -

          Patch with a port of the code.

          Because most of the dictionaries are L/GPL, I've written my own dumb stupid dictionary for test purposes.

          During testing I discovered a long standing bug to do with recursive application of rules This has now been fixed.

          Code now is also version aware, as required by the CharArray* data structures.

          Show
          Chris Male added a comment - Patch with a port of the code. Because most of the dictionaries are L/GPL, I've written my own dumb stupid dictionary for test purposes. During testing I discovered a long standing bug to do with recursive application of rules This has now been fixed. Code now is also version aware, as required by the CharArray* data structures.
          Hide
          Jan Høydahl added a comment -

          +1

          We now use Lucene Hunspell for a few customer deployments, and it would be great to have it the analysis module, since it supports some 70-80 languages out of the box, and gives great flexibility since you can edit - or augment - the dictionaries to change behaviour and fix stemming bugs.

          As a side benefit I also expect that when the Ooo dictionaries get more use in Lucene, users will over time be able to extend and improve the dictionaries, and contribute their changes back, benefiting also Ooo users.

          Show
          Jan Høydahl added a comment - +1 We now use Lucene Hunspell for a few customer deployments, and it would be great to have it the analysis module, since it supports some 70-80 languages out of the box, and gives great flexibility since you can edit - or augment - the dictionaries to change behaviour and fix stemming bugs. As a side benefit I also expect that when the Ooo dictionaries get more use in Lucene, users will over time be able to extend and improve the dictionaries, and contribute their changes back, benefiting also Ooo users.

            People

            • Assignee:
              Chris Male
              Reporter:
              Chris Male
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development