Details

      Description

      Now that Hunspell stemmer is added to Lucene (LUCENE-3414), let's make a Factory for it in Solr

      1. SOLR-2769.patch
        8 kB
        Jan Høydahl
      2. SOLR-2769.patch
        7 kB
        Jan Høydahl
      3. SOLR-2769.patch
        7 kB
        Jan Høydahl
      4. SOLR-2769.patch
        4 kB
        Jan Høydahl
      5. SOLR-2769-branch_3x.patch
        8 kB
        Jan Høydahl
      6. SOLR-2769-branch_3x.patch
        8 kB
        Jan Høydahl

        Issue Links

          Activity

          Hide
          Jan Høydahl added a comment -

          Took the Java class from google code, and added the required version argument to Dictionary constructor (Version.LUCENE_40).

          No JUnit test yet. Guess we can refer to the dic and aff files in analysis module when testing the factory as well.

          Show
          Jan Høydahl added a comment - Took the Java class from google code, and added the required version argument to Dictionary constructor (Version.LUCENE_40). No JUnit test yet. Guess we can refer to the dic and aff files in analysis module when testing the factory as well.
          Hide
          Chris Male added a comment -

          Hi Jan,

          Thanks for tackling this. Only 1 comment, rather than hardcoding the Lucene Version, I think its better to use that configured in the SolrConfig. Take a look at StandardFilterFactory to see how its done.

          Can we add some {} to the for-loop as well? just so its clearer.

          Show
          Chris Male added a comment - Hi Jan, Thanks for tackling this. Only 1 comment, rather than hardcoding the Lucene Version, I think its better to use that configured in the SolrConfig. Take a look at StandardFilterFactory to see how its done. Can we add some {} to the for-loop as well? just so its clearer.
          Hide
          Jan Høydahl added a comment -

          Fixed version arg, and added test. Not sure if calling inform() explicitly from the test code is the easiest way to initialize the factory, but it works.

          Show
          Jan Høydahl added a comment - Fixed version arg, and added test. Not sure if calling inform() explicitly from the test code is the easiest way to initialize the factory, but it works.
          Hide
          Chris Male added a comment -

          Can we call assertMatchVersion() (from BaseTokenStreamFactory) somewhere? either in inform or in create.

          Show
          Chris Male added a comment - Can we call assertMatchVersion() (from BaseTokenStreamFactory) somewhere? either in inform or in create.
          Hide
          Jan Høydahl added a comment -

          Since inform() requires it and is called first, I put assureMatchVersion() in inform()

          Anything else needed to finalize this?

          Show
          Jan Høydahl added a comment - Since inform() requires it and is called first, I put assureMatchVersion() in inform() Anything else needed to finalize this?
          Hide
          Chris Male added a comment -

          Looks fantastic. +1 to committing immediately.

          Show
          Chris Male added a comment - Looks fantastic. +1 to committing immediately.
          Hide
          Jan Høydahl added a comment -

          Attaching branch_3x patch, identical except from package name for WhitespaceTokenizer

          Show
          Jan Høydahl added a comment - Attaching branch_3x patch, identical except from package name for WhitespaceTokenizer
          Hide
          Jan Høydahl added a comment -

          Better Javadoc with example XML and link to dictionaries

          Show
          Jan Høydahl added a comment - Better Javadoc with example XML and link to dictionaries
          Hide
          Jan Høydahl added a comment -

          Same for branch

          Show
          Jan Høydahl added a comment - Same for branch
          Hide
          Jan Høydahl added a comment -

          Checked in for trunk and 3x

          Show
          Jan Høydahl added a comment - Checked in for trunk and 3x
          Show
          Jan Høydahl added a comment - Updated documentation: http://wiki.apache.org/solr/Hunspell http://wiki.apache.org/solr/HunspellStemFilterFactory http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters http://wiki.apache.org/solr/LanguageAnalysis#Stemming
          Hide
          Robert Muir added a comment -

          I think we should be more cautious on recommending Hunspell on the wiki here, for these reasons:

          • The algorithm relies entirely on the quality of the dictionary, for many of these languages the dictionary is not good for this purpose: no affix rules, just a list of words, etc
          • Even in the case where a particular dictionary is pretty good, there are a number of problems: the primary use case of these dictionaries is spellchecking and that doesn't necessarily imply that the rules+affix combinations yield good results here.
          • Finally, the usual problems of having a dictionary-based technique, languages are not static and there absolutely no handling for OOV words.
          Show
          Robert Muir added a comment - I think we should be more cautious on recommending Hunspell on the wiki here, for these reasons: The algorithm relies entirely on the quality of the dictionary, for many of these languages the dictionary is not good for this purpose: no affix rules, just a list of words, etc Even in the case where a particular dictionary is pretty good, there are a number of problems: the primary use case of these dictionaries is spellchecking and that doesn't necessarily imply that the rules+affix combinations yield good results here. Finally, the usual problems of having a dictionary-based technique, languages are not static and there absolutely no handling for OOV words.
          Hide
          Chris Male added a comment -

          Hey Jan,

          I fixed the @see javadoc in the factory. Both IntelliJ and ant javadoc reported that you can't do @see like that (with URLs).

          Show
          Chris Male added a comment - Hey Jan, I fixed the @see javadoc in the factory. Both IntelliJ and ant javadoc reported that you can't do @see like that (with URLs).
          Hide
          Jan Høydahl added a comment -

          Thanks for the @see fix, Chris.

          Robert, you're probably right. I've rephrased some of the statements and added reservations about dictionary quality.

          Show
          Jan Høydahl added a comment - Thanks for the @see fix, Chris. Robert, you're probably right. I've rephrased some of the statements and added reservations about dictionary quality.
          Hide
          Uwe Schindler added a comment -

          Bulk close after 3.5 is released

          Show
          Uwe Schindler added a comment - Bulk close after 3.5 is released

            People

            • Assignee:
              Unassigned
              Reporter:
              Jan Høydahl
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development