Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer available:
      http://sourceforge.net/projects/morfologik/

      This works differently than LUCENE-2298, and ideally would be another option for users.

      1. LUCENE-2341.patch
        44 kB
        Dawid Weiss
      2. LUCENE-2341.diff
        42 kB
        Michał Dybizbański
      3. LUCENE-2341.patch
        35 kB
        Dawid Weiss
      4. morfologik-polish-1.5.2.jar
        3.42 MB
        Michał Dybizbański
      5. morfologik-stemming-1.5.2.jar
        17 kB
        Michał Dybizbański
      6. morfologik-fsa-1.5.2.jar
        53 kB
        Michał Dybizbański
      7. LUCENE-2341.diff
        35 kB
        Michał Dybizbański
      8. LUCENE-2341.diff
        27 kB
        Michał Dybizbański
      9. LUCENE-2341.diff
        22 kB
        Michał Dybizbański

        Issue Links

          Activity

          Hide
          Michał Dybizbański added a comment -

          Thanks

          Show
          Michał Dybizbański added a comment - Thanks
          Hide
          Dawid Weiss added a comment -

          In trunk. Long live 1.6 support.

          Show
          Dawid Weiss added a comment - In trunk. Long live 1.6 support.
          Hide
          Dawid Weiss added a comment -

          Final patch for this issue. Needs to wait for 1.6 support.

          Show
          Dawid Weiss added a comment - Final patch for this issue. Needs to wait for 1.6 support.
          Hide
          Dawid Weiss added a comment -

          Needs to wait for official support for 1.6 on the trunk.

          Show
          Dawid Weiss added a comment - Needs to wait for official support for 1.6 on the trunk.
          Hide
          Dawid Weiss added a comment -

          You do like those pesky .toString() calls, don't you? I replaced the code slightly to keep char. sequences only; no need to create new objects. I also changed the impl. a bit to go from the start of the returned list -> theoretically, lemmas should be ordered by probability (in practice it's not the case, but may be in the future).

          All looks good, committed in. Thanks!

          Show
          Dawid Weiss added a comment - You do like those pesky .toString() calls, don't you? I replaced the code slightly to keep char. sequences only; no need to create new objects. I also changed the impl. a bit to go from the start of the returned list -> theoretically, lemmas should be ordered by probability (in practice it's not the case, but may be in the future). All looks good, committed in. Thanks!
          Hide
          Dawid Weiss added a comment -

          Thanks Michał. I'll review it later today and commit in if there are no objections. As for the deleted line – yes, it was intentional; we'll piggyback in this patch unless somebody fixes it earlier, no problem.

          Dawid

          Show
          Dawid Weiss added a comment - Thanks Michał. I'll review it later today and commit in if there are no objections. As for the deleted line – yes, it was intentional; we'll piggyback in this patch unless somebody fixes it earlier, no problem. Dawid
          Hide
          Michał Dybizbański added a comment -

          Dawid, I'm attaching a patch with the suggested changes:

          1. MorfologikAnalyzer now doesn't use a LowerCaseFilter. When IStemmer.lookup returns an empty list for an originally cased token, another lookup is made for lowercased one. I hope the test case reflects your intentions.

          2. I've added MorfologikPOSAttributeImpl class that provides information about morphosyntactic annotations for each lemma, obtained with WordData.getTag(). A test provides a short insight for potential users. Two notes here:
          a) Since MorfologikPOSAttribute might be unused, I've implemented it in terms of CharSequence (and not String), to not convert prematurely each POS tag to String.
          b) Currently a default POS (for a nonlemmatized token) is an empty String, however null value might be more distinctive if empty POS tags for lemma were allowed.

          BTW, the patch deletes one line from dev-tools/eclipse/dot.classpath (<classpathentry kind="src" path="modules/queries/src/test"/>) - was that intentional ?

          Show
          Michał Dybizbański added a comment - Dawid, I'm attaching a patch with the suggested changes: 1. MorfologikAnalyzer now doesn't use a LowerCaseFilter. When IStemmer.lookup returns an empty list for an originally cased token, another lookup is made for lowercased one. I hope the test case reflects your intentions. 2. I've added MorfologikPOSAttributeImpl class that provides information about morphosyntactic annotations for each lemma, obtained with WordData.getTag(). A test provides a short insight for potential users. Two notes here: a) Since MorfologikPOSAttribute might be unused, I've implemented it in terms of CharSequence (and not String), to not convert prematurely each POS tag to String. b) Currently a default POS (for a nonlemmatized token) is an empty String, however null value might be more distinctive if empty POS tags for lemma were allowed. BTW, the patch deletes one line from dev-tools/eclipse/dot.classpath (<classpathentry kind="src" path="modules/queries/src/test"/>) - was that intentional ?
          Hide
          Dawid Weiss added a comment -

          I've cleaned up the patch, but I'd still address the two TODOs that I left in the code:

          • lowercasing should be done not at the external filter level, but inside the filter as a fallback IF AND ONLY IF the original sequence is not found in the dictionary. Morfeusz and Morfologik do have uppercase surface forms and do treat them differently (returning uppercase lemmas, for example). A test for this would be nice as well. An example of an uppercase/mixed surface form: AGD, Aaron, Poznania.
          • I'd expose another attribute with morphosyntactic annotations – this is something that is there anyway, so why not expose it.

          I attached a git diff, but it should apply with patch -p1 < ... too. Michał, will you have the time to polish this off?

          Show
          Dawid Weiss added a comment - I've cleaned up the patch, but I'd still address the two TODOs that I left in the code: lowercasing should be done not at the external filter level, but inside the filter as a fallback IF AND ONLY IF the original sequence is not found in the dictionary. Morfeusz and Morfologik do have uppercase surface forms and do treat them differently (returning uppercase lemmas, for example). A test for this would be nice as well. An example of an uppercase/mixed surface form: AGD, Aaron, Poznania. I'd expose another attribute with morphosyntactic annotations – this is something that is there anyway, so why not expose it. I attached a git diff, but it should apply with patch -p1 < ... too. Michał, will you have the time to polish this off?
          Hide
          Dawid Weiss added a comment -

          Updated patch. A few minor corrections and performance considerations (no intermediate strings). Updated eclipse classpath template.

          Show
          Dawid Weiss added a comment - Updated patch. A few minor corrections and performance considerations (no intermediate strings). Updated eclipse classpath template.
          Hide
          Dawid Weiss added a comment -

          Working on the integration, will provide a final patch before committing. Thanks Michał.

          Show
          Dawid Weiss added a comment - Working on the integration, will provide a final patch before committing. Thanks Michał.
          Hide
          Robert Muir added a comment -

          provided each thread obtains its own TokenStreamComponents through ReusableAnalyzerBase.createComponents (is this always the case ? looking at other filters, thay don't look thread-safe neither ..)

          yes, its the case that Analyzer/ReusableAnalyzerBase take care of this with a threadlocal, as long as each thread only needs to use one tokenstream at a time (which is true for all lucene consumers), see:
          http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/analysis/Analyzer.java

          Show
          Robert Muir added a comment - provided each thread obtains its own TokenStreamComponents through ReusableAnalyzerBase.createComponents (is this always the case ? looking at other filters, thay don't look thread-safe neither ..) yes, its the case that Analyzer/ReusableAnalyzerBase take care of this with a threadlocal, as long as each thread only needs to use one tokenstream at a time (which is true for all lucene consumers), see: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/analysis/Analyzer.java
          Hide
          Michał Dybizbański added a comment - - edited

          Dawid, as you suggested, I've changed the interface to MorfologikAnalyzer and MorfologikFilter to account for the changes in Morfologik 1.5.2, namely the multiple dictionaries.
          Both those classes' constructors now accept a PolishStemmer.DICTIONARY (instead of languageCode String as in previous patch). A PolishStemmer object is instantiated by MorfologikFilter, so each invocation of MorfologikAnalyzer.createComponents (which instantiates MorfologikFilter) is coupled with an individual instance of PolishStemmer.
          This way, sharing a MorfologikAnalyzer by separate threads is safe (even though MorfologikFilter itself isn't thread-safe) provided each thread obtains its own TokenStreamComponents through ReusableAnalyzerBase.createComponents (is this always the case ? looking at other filters, thay don't look thread-safe neither ..)

          Show
          Michał Dybizbański added a comment - - edited Dawid, as you suggested, I've changed the interface to MorfologikAnalyzer and MorfologikFilter to account for the changes in Morfologik 1.5.2, namely the multiple dictionaries. Both those classes' constructors now accept a PolishStemmer.DICTIONARY (instead of languageCode String as in previous patch). A PolishStemmer object is instantiated by MorfologikFilter, so each invocation of MorfologikAnalyzer.createComponents (which instantiates MorfologikFilter) is coupled with an individual instance of PolishStemmer. This way, sharing a MorfologikAnalyzer by separate threads is safe (even though MorfologikFilter itself isn't thread-safe) provided each thread obtains its own TokenStreamComponents through ReusableAnalyzerBase.createComponents (is this always the case ? looking at other filters, thay don't look thread-safe neither ..)
          Hide
          Dawid Weiss added a comment -

          Dawid, do you think it's reasonable to optimize further and use directly a list returned by IStemmer.lookup (instead of copying with addAll) ? My concern is that (at least in current DictionaryLookup implementation) that list seems to be shared by distinct invocations of the lookup method, which would make the use of a specific IStemmer not applicable in thread-safe code.

          IStemmer implementations are not thread safe anyway, so there is no problem in reusing that list. In fact, the returned WordData objects are reused internally as well, so you can't store them either (this is done to avoid GC overhead).

          So yes: I missed that, but you'll need to ensure IStemmer instances are not shared. This can be done in various ways (thread local, etc), but I think the simplest way to do it would be to instantiate PolishStemmer at the MorfologikFilter level. This is cheap (the dictionary is loaded once anyway).

          You can then create two constructors in the analyzer – one with PolishStemmer.DICTIONARY and one with the default (I'd suggest MORFOLOGIK). Exposing IStemmer constructor will do more harm than good – thinking ahead is good, but in this case I don't think there'll be this many people interested in subclassing IStemmer (if anything, they'll plug into Lucene's infrastructure directly).

          A simple test case spawning 5 or 10 threads in a parallel executor and crunching stems on the same analyzer would also be nice to ensure we have everything correct wrt multithreading, but it's not that crucial if you don't have the time to write it.

          Thanks!

          Show
          Dawid Weiss added a comment - Dawid, do you think it's reasonable to optimize further and use directly a list returned by IStemmer.lookup (instead of copying with addAll) ? My concern is that (at least in current DictionaryLookup implementation) that list seems to be shared by distinct invocations of the lookup method, which would make the use of a specific IStemmer not applicable in thread-safe code. IStemmer implementations are not thread safe anyway, so there is no problem in reusing that list. In fact, the returned WordData objects are reused internally as well, so you can't store them either (this is done to avoid GC overhead). So yes: I missed that, but you'll need to ensure IStemmer instances are not shared. This can be done in various ways (thread local, etc), but I think the simplest way to do it would be to instantiate PolishStemmer at the MorfologikFilter level. This is cheap (the dictionary is loaded once anyway). You can then create two constructors in the analyzer – one with PolishStemmer.DICTIONARY and one with the default (I'd suggest MORFOLOGIK). Exposing IStemmer constructor will do more harm than good – thinking ahead is good, but in this case I don't think there'll be this many people interested in subclassing IStemmer (if anything, they'll plug into Lucene's infrastructure directly). A simple test case spawning 5 or 10 threads in a parallel executor and crunching stems on the same analyzer would also be nice to ensure we have everything correct wrt multithreading, but it's not that crucial if you don't have the time to write it. Thanks!
          Hide
          Michał Dybizbański added a comment -

          Thank you guys for suggestions

          I've changed the diff to include them:

          1. Implemented MorfologikFilter.reset that resets the stemsAcc, and added a test case that would fail in the absence of that implementation, which exhibits the behaviour mentioned by Robert.

          2. Updated modules/analysis/NOTICE.txt and modules/analysis/LICENSE.txt - Robert, is that what you meant, or do they need to include more information ?

          3. MorfologikFilter now uses an explicit pointer to not modify the stemsAcc on each pass - Dawid, do you think it's reasonable to optimize further and use directly a list returned by IStemmer.lookup (instead of copying with addAll) ? My concern is that (at least in current DictionaryLookup implementation) that list seems to be shared by distinct invocations of the lookup method, which would make the use of a specific IStemmer not applicable in thread-safe code.

          4. Removed explicit call to getStem().toString().

          As for the new Morfologik version, I've been thinking it would be better to alter the constructors of MorfologikAnalyzer and MorfologikFilter to accept concrete IStemmer implementations, instead of a languageCode String as they do now. This way, org.apache.lucene.analysis.morfologik package wouldn't depend on current implementations of IStemmer (only on the interface), and also allowed future ones to be used without changing the package. What do you think ?

          That could also solve the case of a custom attribute for POS tags (MorfologikPOSAttribute ?) : since a client would instantiate their IStemmer explicitly, they would know the meaning of the attribute's value. That doesn't take into account the DICTIONARY.COMBINED stemmer, but the same seems to apply to the Morfologik library itself (I mean, for a specific WordData from IStemmer.lookup there is no information on which of the internal concrete DictionaryLookup it comes from). Dawid - what do you think of that issue ?

          Show
          Michał Dybizbański added a comment - Thank you guys for suggestions I've changed the diff to include them: 1. Implemented MorfologikFilter.reset that resets the stemsAcc, and added a test case that would fail in the absence of that implementation, which exhibits the behaviour mentioned by Robert. 2. Updated modules/analysis/NOTICE.txt and modules/analysis/LICENSE.txt - Robert, is that what you meant, or do they need to include more information ? 3. MorfologikFilter now uses an explicit pointer to not modify the stemsAcc on each pass - Dawid, do you think it's reasonable to optimize further and use directly a list returned by IStemmer.lookup (instead of copying with addAll) ? My concern is that (at least in current DictionaryLookup implementation) that list seems to be shared by distinct invocations of the lookup method, which would make the use of a specific IStemmer not applicable in thread-safe code. 4. Removed explicit call to getStem().toString(). As for the new Morfologik version, I've been thinking it would be better to alter the constructors of MorfologikAnalyzer and MorfologikFilter to accept concrete IStemmer implementations, instead of a languageCode String as they do now. This way, org.apache.lucene.analysis.morfologik package wouldn't depend on current implementations of IStemmer (only on the interface), and also allowed future ones to be used without changing the package. What do you think ? That could also solve the case of a custom attribute for POS tags (MorfologikPOSAttribute ?) : since a client would instantiate their IStemmer explicitly, they would know the meaning of the attribute's value. That doesn't take into account the DICTIONARY.COMBINED stemmer, but the same seems to apply to the Morfologik library itself (I mean, for a specific WordData from IStemmer.lookup there is no information on which of the internal concrete DictionaryLookup it comes from). Dawid - what do you think of that issue ?
          Hide
          Dawid Weiss added a comment -

          I've just published morfologik 1.5.2, Michał. This comes with two dictionaries (morfologik and morfeusz) that can be used as one (fallback for missing words) or separately, but I would stick to using morfologik as the default dictionary (possibly with an option of using morfeusz?). POS tags have a different notation in these two resources, so mixing both is probably not a good idea.

          Will you update the patch? Thanks.

          Show
          Dawid Weiss added a comment - I've just published morfologik 1.5.2, Michał. This comes with two dictionaries (morfologik and morfeusz) that can be used as one (fallback for missing words) or separately, but I would stick to using morfologik as the default dictionary (possibly with an option of using morfeusz?). POS tags have a different notation in these two resources, so mixing both is probably not a good idea. Will you update the patch? Thanks.
          Hide
          Robert Muir added a comment -

          Eventually it would be probably sensible to limit the automaton for use in Lucene to store surface forms and lemmas only (no POS tags) and merge both dictionaries into a single automaton... but this can be a future improvement.

          or alternatively, you can expose the POS tags for each stem to lucene right, easiest way would be to put it into TypeAttribute (a string), but you could make your own strongly-typed one if thats a better fit.

          this could be useful for downstream processing.

          Show
          Robert Muir added a comment - Eventually it would be probably sensible to limit the automaton for use in Lucene to store surface forms and lemmas only (no POS tags) and merge both dictionaries into a single automaton... but this can be a future improvement. or alternatively, you can expose the POS tags for each stem to lucene right, easiest way would be to put it into TypeAttribute (a string), but you could make your own strongly-typed one if thats a better fit. this could be useful for downstream processing.
          Hide
          Dawid Weiss added a comment -

          One note wrt patch: I would use an explicit pointer over a list of returned WordData entries instead of adding them to a local list:

          private List<WordData> stemsAcc = new ArrayList<WordData>();

          Right now you're shifting the internal array on each call unnecessarily (just increase an int ptr instead):

          + termAtt.setEmpty().append(stemsAcc.remove(0).getStem().toString());

          getStem() should also be enough since it's a CharSequence, right? No need for an intermediate String.

          Show
          Dawid Weiss added a comment - One note wrt patch: I would use an explicit pointer over a list of returned WordData entries instead of adding them to a local list: private List<WordData> stemsAcc = new ArrayList<WordData>(); Right now you're shifting the internal array on each call unnecessarily (just increase an int ptr instead): + termAtt.setEmpty().append(stemsAcc.remove(0).getStem().toString()); getStem() should also be enough since it's a CharSequence, right? No need for an intermediate String.
          Hide
          Dawid Weiss added a comment -

          I did some analyses on both dictionaries.

          Number of lines (distict surface forms):
          
            3.662.366 morfologik.utf8
            5.086.141 sgjp.utf8
          
          Distinct words (not in both):
          
            2.729.334 unique.utf8
          
            - upper/lower case (morfologik has upper case forms, morfeusz only lower case surface forms)
              
              acerze
              Acerze
          
            - very rare or jargon;
          
              abszminka
              abszytowałem
              acetobakteria
              acetarsolowi
              niebombiasto
              hakatystce
              hakatystycznościach
              warzże
          
            - differences in spelling;
          
              abelard
              abélard
          
            - acronyms and super-short stuff
          
              aap
              aar
          
          Dictinct normalized (lowercase):
          
            2.564.366 lowered.utf8
          
            Most of these are very infrequent words or inflection forms. There are minor differences or
            missing surface forms in both dictionaries, as in here (mz - morfeusz, mk - morfologik):
          
          mz> hakersko
          mz> hakerskość
          mz> hakerskości
          mz> hakerskością
          mz> hakerskościach
          mz> hakerskościami
          mz> hakerskościom
          mk> hakerstw
          mk> hakerstwa
          ...
          mk> hakowałyśmy
          mk> hakowań
          mk> hakowaniach
          mk> hakowaniami
          mk> hakowaniom
          mz> hakowatość
          mz> hakowatości
          mz> hakowatością
          mz> hakowatościach
          mz> hakowatościami
          mz> hakowatościom
          

          So... the conclusion is pretty consistent with Zipf's law: both dictionaries have a fairly different coverage, even if they're quite large. We don't have a frequency dictionary for Polish, but I assume most of these surface forms are purely theoretical and occur super-rarely in practice. This said, I think we should use BOTH dictionaries – after all there's no harm done if we overdo the lemmatization process a little bit, is there?

          So... my proposal would be this: I'll integrate Morfeusz's dictionary in Morfologik (as an alternative dictionary one can load and use).

          Eventually it would be probably sensible to limit the automaton for use in Lucene to store surface forms and lemmas only (no POS tags) and merge both dictionaries into a single automaton... but this can be a future improvement.

          Show
          Dawid Weiss added a comment - I did some analyses on both dictionaries. Number of lines (distict surface forms): 3.662.366 morfologik.utf8 5.086.141 sgjp.utf8 Distinct words (not in both): 2.729.334 unique.utf8 - upper/lower case (morfologik has upper case forms, morfeusz only lower case surface forms) acerze Acerze - very rare or jargon; abszminka abszytowałem acetobakteria acetarsolowi niebombiasto hakatystce hakatystycznościach warzże - differences in spelling; abelard abélard - acronyms and super-short stuff aap aar Dictinct normalized (lowercase): 2.564.366 lowered.utf8 Most of these are very infrequent words or inflection forms. There are minor differences or missing surface forms in both dictionaries, as in here (mz - morfeusz, mk - morfologik): mz> hakersko mz> hakerskość mz> hakerskości mz> hakerskością mz> hakerskościach mz> hakerskościami mz> hakerskościom mk> hakerstw mk> hakerstwa ... mk> hakowałyśmy mk> hakowań mk> hakowaniach mk> hakowaniami mk> hakowaniom mz> hakowatość mz> hakowatości mz> hakowatością mz> hakowatościach mz> hakowatościami mz> hakowatościom So... the conclusion is pretty consistent with Zipf's law: both dictionaries have a fairly different coverage, even if they're quite large. We don't have a frequency dictionary for Polish, but I assume most of these surface forms are purely theoretical and occur super-rarely in practice. This said, I think we should use BOTH dictionaries – after all there's no harm done if we overdo the lemmatization process a little bit, is there? So... my proposal would be this: I'll integrate Morfeusz's dictionary in Morfologik (as an alternative dictionary one can load and use). Eventually it would be probably sensible to limit the automaton for use in Lucene to store surface forms and lemmas only (no POS tags) and merge both dictionaries into a single automaton... but this can be a future improvement.
          Hide
          Dawid Weiss added a comment -

          I'll take a look at the differences between Morfologik and Morfeusz right now, actually. I'll post the results once I have something.

          Show
          Dawid Weiss added a comment - I'll take a look at the differences between Morfologik and Morfeusz right now, actually. I'll post the results once I have something.
          Hide
          Dawid Weiss added a comment -

          Thanks for the contribution, Michał.

          Robert: the dictionary is licensed under MPL or CC-SA (to be selected by the user depending on one's needs). Do you know which one is preferable over another?

          Michał: there is also another (much larger) dictionary that has been released recently and comes from the Morfeusz project. http://sgjp.pl/morfeusz/dopobrania.html This dictionary is actually licensed under BSD license, so no legal worries at all. Both dictionaries are nearly identical (they differ slightly in their convention of morphosyntactic annotations) and Morfeusz's dictionary could be compiled into an automaton for use with Morfologik.

          Which way should we go? What do you think?

          Show
          Dawid Weiss added a comment - Thanks for the contribution, Michał. Robert: the dictionary is licensed under MPL or CC-SA (to be selected by the user depending on one's needs). Do you know which one is preferable over another? Michał: there is also another (much larger) dictionary that has been released recently and comes from the Morfeusz project. http://sgjp.pl/morfeusz/dopobrania.html This dictionary is actually licensed under BSD license, so no legal worries at all. Both dictionaries are nearly identical (they differ slightly in their convention of morphosyntactic annotations) and Morfeusz's dictionary could be compiled into an automaton for use with Morfologik. Which way should we go? What do you think?
          Hide
          Robert Muir added a comment -

          Sorry, about my second comment i was confusing this with the stuff you have for the morfologik jar itself, which is correct

          What i should have said was, I think we should include this information in the top-level modules/analysis/LICENSE.txt and modules/analysis/NOTICE.txt

          Show
          Robert Muir added a comment - Sorry, about my second comment i was confusing this with the stuff you have for the morfologik jar itself, which is correct What i should have said was, I think we should include this information in the top-level modules/analysis/LICENSE.txt and modules/analysis/NOTICE.txt
          Hide
          Robert Muir added a comment -

          Hi Michał,

          This patch looks great!

          I took a quick glance, here are a couple suggestions:

          • In the MorfologikFilter, I think we should implement reset(), first calling the superclass reset(), then clearing the stemsAcc list. This ensures that all of the filter's state is cleared before it is reused. Under normal operations, this should not be necessary, but some consumers in Lucene (e.g. LimitTokenCountFilter, and some similar code in the Highlighter), will only partially consume up to some point, then suddenly stop. By clearing this list in reset() we ensure that there is no chance any leftover stems will appear in the next stream.
          • because the data is licensed under MPL, I think we should explicitly list a hyperlink if possible to the source code used in the NOTICE.txt. I saw you included some wordage in LICENSE.txt but I think this should only say 'XYZ data is under this license, with the actual MPL license text. In the NOTICE.txt we should link to the source code I think... there is some more information on this under the section Category B: Reciprocal Licenses at http://www.apache.org/legal/3party.html
          Show
          Robert Muir added a comment - Hi Michał, This patch looks great! I took a quick glance, here are a couple suggestions: In the MorfologikFilter, I think we should implement reset(), first calling the superclass reset(), then clearing the stemsAcc list. This ensures that all of the filter's state is cleared before it is reused. Under normal operations, this should not be necessary, but some consumers in Lucene (e.g. LimitTokenCountFilter, and some similar code in the Highlighter), will only partially consume up to some point, then suddenly stop. By clearing this list in reset() we ensure that there is no chance any leftover stems will appear in the next stream. because the data is licensed under MPL, I think we should explicitly list a hyperlink if possible to the source code used in the NOTICE.txt. I saw you included some wordage in LICENSE.txt but I think this should only say 'XYZ data is under this license, with the actual MPL license text. In the NOTICE.txt we should link to the source code I think... there is some more information on this under the section Category B: Reciprocal Licenses at http://www.apache.org/legal/3party.html
          Hide
          Michał Dybizbański added a comment -

          Hi

          This patch introduces stemming filter and analyzer, that use Morfologik library, developed by Dawid Weiss and Marcin Miłkowski.
          Tokens are stemmed by Morfologik with a dictionary, and current distribution provides a dictionary for polish language.

          The MorfologikFilter yields one or more terms for each token. Each of those terms is given the same position in the index.

          I'm attaching a binary distribution of the library (morfologik-stemming-1.5.0.jar), that needs to be placed in modules/analysis/morfologik/lib/ subdirectory.
          It is also available as a Maven artifact.

          The library is BSD-licensed and a dictionary uses data from Polish dictionary for aspell/ispell/myspell (SJP.PL), which is licensed under GPL, LGPL, MPL and CC SA licenses.

          This is my first contribution to the Lucene project, so please be forgiving
          Thanks to Dawid for help.

          Regards,
          Michał

          Show
          Michał Dybizbański added a comment - Hi This patch introduces stemming filter and analyzer, that use Morfologik library , developed by Dawid Weiss and Marcin Miłkowski. Tokens are stemmed by Morfologik with a dictionary, and current distribution provides a dictionary for polish language. The MorfologikFilter yields one or more terms for each token. Each of those terms is given the same position in the index. I'm attaching a binary distribution of the library (morfologik-stemming-1.5.0.jar), that needs to be placed in modules/analysis/morfologik/lib/ subdirectory. It is also available as a Maven artifact . The library is BSD-licensed and a dictionary uses data from Polish dictionary for aspell/ispell/myspell (SJP.PL) , which is licensed under GPL, LGPL, MPL and CC SA licenses. This is my first contribution to the Lucene project, so please be forgiving Thanks to Dawid for help. Regards, Michał
          Hide
          Robert Muir added a comment -

          Andrzej,
          By the way, this is now on by default on trunk.

          Show
          Robert Muir added a comment - Andrzej, By the way, this is now on by default on trunk.
          Hide
          Andrzej Bialecki added a comment -

          When indexing multiple stems/lemmas (or synonyms in general) at the same positions, in most cases users should also set the Similarity.discoutOverlaps to avoid skewing the lengthNorm-s. I think that it should be mentioned somewhere in the javadocs.

          Show
          Andrzej Bialecki added a comment - When indexing multiple stems/lemmas (or synonyms in general) at the same positions, in most cases users should also set the Similarity.discoutOverlaps to avoid skewing the lengthNorm-s. I think that it should be mentioned somewhere in the javadocs.
          Hide
          Robert Muir added a comment -

          I think overlapping positions makes sense.
          But this can cause funky scoring for homographs,
          Even if the analysis is legitimately different, tf is wrong.

          Maybe solr's removeduplicatesfilter should be recommended
          down the chain...

          Show
          Robert Muir added a comment - I think overlapping positions makes sense. But this can cause funky scoring for homographs, Even if the analysis is legitimately different, tf is wrong. Maybe solr's removeduplicatesfilter should be recommended down the chain...
          Hide
          Dawid Weiss added a comment -

          Oh, I forgot about this – yes, you're right, it can emit multiple lemmas (including their morphology). An example from the test case:

          		final IStemmer s = new PolishStemmer();
          
          		final String word = "liga";
          		final List<WordData> response = s.lookup(word);
          
          		final HashSet<String> stems = new HashSet<String>();
          		final HashSet<String> tags = new HashSet<String>();
          
          		for (WordData wd : response) {
          			stems.add(wd.getStem().toString());
          			tags.add(wd.getTag().toString());
          			assertSame(word, wd.getWord());
          		}
          
          		assertTrue(stems.contains("ligać"));
          		assertTrue(stems.contains("liga"));
          		assertTrue(tags.contains("subst:sg:nom:f"));
          		assertTrue(tags.contains("verb:fin:sg:ter:imperf"));
          

          This raises the question how we want to handle multiple stems... should they be indexed on overlapping positions?

          Show
          Dawid Weiss added a comment - Oh, I forgot about this – yes, you're right, it can emit multiple lemmas (including their morphology). An example from the test case: final IStemmer s = new PolishStemmer(); final String word = "liga"; final List<WordData> response = s.lookup(word); final HashSet<String> stems = new HashSet<String>(); final HashSet<String> tags = new HashSet<String>(); for (WordData wd : response) { stems.add(wd.getStem().toString()); tags.add(wd.getTag().toString()); assertSame(word, wd.getWord()); } assertTrue(stems.contains("ligać")); assertTrue(stems.contains("liga")); assertTrue(tags.contains("subst:sg:nom:f")); assertTrue(tags.contains("verb:fin:sg:ter:imperf")); This raises the question how we want to handle multiple stems... should they be indexed on overlapping positions?
          Hide
          Robert Muir added a comment -

          oh yeah, and also, what about cases with multiple solutions? does it emit multiple stems?

          Show
          Robert Muir added a comment - oh yeah, and also, what about cases with multiple solutions? does it emit multiple stems?
          Hide
          Robert Muir added a comment -

          Dawid, sounds good. I had a few questions, admittedly not having time to look at the code much:

          • Does it support generation as well?
          • Does it expose any morph attributes (e.g. POS) ?
          Show
          Robert Muir added a comment - Dawid, sounds good. I had a few questions, admittedly not having time to look at the code much: Does it support generation as well? Does it expose any morph attributes (e.g. POS) ?
          Hide
          Dawid Weiss added a comment -

          Robert, should I wait for Stempel patch first and then model this one after you? I'm thinking we can reuse most of the code; these stemmers have nearly identical APIs anyway.

          Show
          Dawid Weiss added a comment - Robert, should I wait for Stempel patch first and then model this one after you? I'm thinking we can reuse most of the code; these stemmers have nearly identical APIs anyway.

            People

            • Assignee:
              Dawid Weiss
              Reporter:
              Robert Muir
            • Votes:
              3 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development