Lucene - Core
  1. Lucene - Core
  2. LUCENE-5356

Morfologik filter can accept custom dictionary resources

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.6
    • Fix Version/s: 4.8, 6.0
    • Component/s: modules/analysis
    • Labels:
    • Lucene Fields:
      New, Patch Available

      Description

      I have little proposal for morfologik lucene module. Current module is tightly coupled with polish DICTIONARY enumeration.
      But other people (like me) can build own dictionaries to FSA and use it with lucene.
      You can find proposal in attachment and also example usage in analyzer (SlovakLemmaAnalyzer).
      It uses dictionary property as String resource from classpath, not enumeration.
      One change is, that dictionary variable must be set in MofologikFilterFactory (no default value).

      1. LUCENE-5356.patch
        9 kB
        Dawid Weiss
      2. LUCENE-5356.patch
        10 kB
        Michal Hlavac
      3. LUCENE-5356.patch
        11 kB
        Michal Hlavac
      4. LUCENE-5356.patch
        7 kB
        Michal Hlavac

        Activity

        Hide
        Dawid Weiss added a comment -

        A quick look at the patch:

           /** Schema attribute. */
        -  @Deprecated
           public static final String DICTIONARY_SCHEMA_ATTRIBUTE = "dictionary";
        

        We should not un-deprecate this property, especially that its new meaning is different to what it was before. The custom dictionary should be a separate property, with a new semantics.

        All the logic in MorfologikLemmatizer seems awkward to me:

        +    @Override
        +    public Iterator<WordData> iterator() {
        +        if (delegate.size() == 1) {
        +            return delegate.get(0).iterator();
        +        } else {
        +            throw new RuntimeException("No iteration over compound stemmer forms: "
        +                    + Arrays.toString(delegate.toArray()));
        +        }
        +    }
        

        How can this ever be != 1 if the only place you add a delegate in is in the constructor?

        Show
        Dawid Weiss added a comment - A quick look at the patch: /** Schema attribute. */ - @Deprecated public static final String DICTIONARY_SCHEMA_ATTRIBUTE = "dictionary" ; We should not un-deprecate this property, especially that its new meaning is different to what it was before. The custom dictionary should be a separate property, with a new semantics. All the logic in MorfologikLemmatizer seems awkward to me: + @Override + public Iterator<WordData> iterator() { + if (delegate.size() == 1) { + return delegate.get(0).iterator(); + } else { + throw new RuntimeException( "No iteration over compound stemmer forms: " + + Arrays.toString(delegate.toArray())); + } + } How can this ever be != 1 if the only place you add a delegate in is in the constructor?
        Show
        Michal Hlavac added a comment - It's similar code to: https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-polish/src/main/java/morfologik/stemming/PolishStemmer.java
        Hide
        Michal Hlavac added a comment -

        Another point is that lucene-morfologic doesn't need dependency to morfologic-polish library anymore. It's not included in patch.

        Show
        Michal Hlavac added a comment - Another point is that lucene-morfologic doesn't need dependency to morfologic-polish library anymore. It's not included in patch.
        Hide
        Dawid Weiss added a comment -

        I know it's similar but in PolishStemmer the reason for having multiple delegates was that there actually were multiple delegates – the code now doesn't make much sense and should be fixed there too.

        Show
        Dawid Weiss added a comment - I know it's similar but in PolishStemmer the reason for having multiple delegates was that there actually were multiple delegates – the code now doesn't make much sense and should be fixed there too.
        Hide
        Michal Hlavac added a comment -

        used only one delegate in MorfologicLemmatizer

        Show
        Michal Hlavac added a comment - used only one delegate in MorfologicLemmatizer
        Hide
        Dawid Weiss added a comment -

        I looked at the patch and wanted to apply it but there are still some showstoppers to me.

        • property deprecation was not handled the way I mentioned in my previous comment
        • the default mode should be backwards compatible (no custom dictionary => Polish dictionary), so the test should pass without passing 'pl' as the dictionary too. a custom-dictionary test should be added.
        • javadocs and comments need to be updated to reflect this change
        • MorfologikLemmatizer is not needed at all, an IStemmer is enough (this class is a dummy delegate now)
        • this is not the same:
          -      me.setContextClassLoader(PolishStemmer.class.getClassLoader());
          -      this.stemmer = new PolishStemmer();
          +      me.setContextClassLoader(MorfologikLemmatizer.class.getClassLoader());
          +      this.stemmer = new MorfologikLemmatizer(dict);
          

          the context class loader should be left as it was (pointing to PolishStemmer); if the custom dictionary is within that classloader's scope (it should be) it'll be loaded.

        Show
        Dawid Weiss added a comment - I looked at the patch and wanted to apply it but there are still some showstoppers to me. property deprecation was not handled the way I mentioned in my previous comment the default mode should be backwards compatible (no custom dictionary => Polish dictionary), so the test should pass without passing 'pl' as the dictionary too. a custom-dictionary test should be added. javadocs and comments need to be updated to reflect this change MorfologikLemmatizer is not needed at all, an IStemmer is enough (this class is a dummy delegate now) this is not the same: - me.setContextClassLoader(PolishStemmer.class.getClassLoader()); - this .stemmer = new PolishStemmer(); + me.setContextClassLoader(MorfologikLemmatizer.class.getClassLoader()); + this .stemmer = new MorfologikLemmatizer(dict); the context class loader should be left as it was (pointing to PolishStemmer); if the custom dictionary is within that classloader's scope (it should be) it'll be loaded.
        Hide
        Michal Hlavac added a comment -

        Ok, I'll try to change what you say.
        One of base motivation was to remove morfologik-polish from dependecies. It's not backwards compatible but it's more generic. I don't need polish dictionary when I am using e.g. english dictionary.

        Show
        Michal Hlavac added a comment - Ok, I'll try to change what you say. One of base motivation was to remove morfologik-polish from dependecies. It's not backwards compatible but it's more generic. I don't need polish dictionary when I am using e.g. english dictionary.
        Hide
        Dawid Weiss added a comment -

        You don't need it but it has to be backwards compatible because others may rely on it. So we can't just change how it currently works. Alternatively, you can provide an entirely different filter factory class.

        Show
        Dawid Weiss added a comment - You don't need it but it has to be backwards compatible because others may rely on it. So we can't just change how it currently works. Alternatively, you can provide an entirely different filter factory class.
        Hide
        Michal Hlavac added a comment -

        Can't we change it even in major version release?

        Show
        Michal Hlavac added a comment - Can't we change it even in major version release?
        Hide
        Dawid Weiss added a comment -

        We could, but it seems like something that could be implemented and backported to the branch as well. I would do it myself, but I don't want to steal your thunder

        Show
        Dawid Weiss added a comment - We could, but it seems like something that could be implemented and backported to the branch as well. I would do it myself, but I don't want to steal your thunder
        Hide
        Michal Hlavac added a comment -

        This patch contains proposals from previous issue comments.

        Show
        Michal Hlavac added a comment - This patch contains proposals from previous issue comments.
        Hide
        Michal Hlavac added a comment -

        Dawin, is it possible to move on with this issue? thanks

        Show
        Michal Hlavac added a comment - Dawin, is it possible to move on with this issue? thanks
        Hide
        Dawid Weiss added a comment -

        Hi Michal. Sorry, it slipped my mind somehow. I'll look at it over the weekend. Thanks for reminding me.

        Show
        Dawid Weiss added a comment - Hi Michal. Sorry, it slipped my mind somehow. I'll look at it over the weekend. Thanks for reminding me.
        Hide
        Ahmet Arslan added a comment -

        Hi Michal Hlavac , is it possible to use morfologik to convert https://github.com/coltekin/TRmorph to java and create a stem filter?

        Show
        Ahmet Arslan added a comment - Hi Michal Hlavac , is it possible to use morfologik to convert https://github.com/coltekin/TRmorph to java and create a stem filter?
        Hide
        Michal Hlavac added a comment - - edited

        Hi Ahmet,
        I think this is not right way how to ask quetions like this. Please use lucene's user mailing list. Thanks

        Show
        Michal Hlavac added a comment - - edited Hi Ahmet, I think this is not right way how to ask quetions like this. Please use lucene's user mailing list. Thanks
        Hide
        Dawid Weiss added a comment -

        Polished a few minor things wrt the documentation, added CHANGES entry.

        Show
        Dawid Weiss added a comment - Polished a few minor things wrt the documentation, added CHANGES entry.
        Hide
        ASF subversion and git services added a comment -

        Commit 1580853 from Dawid Weiss in branch 'dev/trunk'
        [ https://svn.apache.org/r1580853 ]

        LUCENE-5356: Morfologik filter can accept custom dictionary resources.

        Show
        ASF subversion and git services added a comment - Commit 1580853 from Dawid Weiss in branch 'dev/trunk' [ https://svn.apache.org/r1580853 ] LUCENE-5356 : Morfologik filter can accept custom dictionary resources.
        Hide
        ASF subversion and git services added a comment -

        Commit 1580858 from Dawid Weiss in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1580858 ]

        LUCENE-5356: Morfologik filter can accept custom dictionary resources.

        Show
        ASF subversion and git services added a comment - Commit 1580858 from Dawid Weiss in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1580858 ] LUCENE-5356 : Morfologik filter can accept custom dictionary resources.
        Hide
        Dawid Weiss added a comment -

        Thanks Michal.

        Show
        Dawid Weiss added a comment - Thanks Michal.
        Hide
        Michal Hlavac added a comment -

        One thing I forget. Is it possible to change scope of morfologik-polish dependency to test?

        Show
        Michal Hlavac added a comment - One thing I forget. Is it possible to change scope of morfologik-polish dependency to test?
        Hide
        Dawid Weiss added a comment -

        This plugin basically provides Polish stemming as its primary goal (and it's the only dictionary available in morfologik-stemming out of the box) so excluding Polish dictionaries would be odd and counterintuitive. If you have your own dictionaries and you're using maven, exclude unnecessary dependencies in your POM.

        Show
        Dawid Weiss added a comment - This plugin basically provides Polish stemming as its primary goal (and it's the only dictionary available in morfologik-stemming out of the box) so excluding Polish dictionaries would be odd and counterintuitive. If you have your own dictionaries and you're using maven, exclude unnecessary dependencies in your POM.
        Hide
        Uwe Schindler added a comment -

        Close issue after release of 4.8.0

        Show
        Uwe Schindler added a comment - Close issue after release of 4.8.0

          People

          • Assignee:
            Dawid Weiss
            Reporter:
            Michal Hlavac
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development