Lucene - Core
  1. Lucene - Core
  2. LUCENE-6742

Fix SnowballFilter for Lovins & Finnish (and others that use reflection)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      While reviewing the warnings (LUCENE-6740) I noticed the following:
      The LovinsStemmer and the FinnishStemmer from the snowball package cannot work at all. Indeed Robert Muir added a comment to the test referring to: http://article.gmane.org/gmane.comp.search.snowball/1139

      The bug is the following: The Among class looks up the method to call via reflection. But the stemmer compiler creates all these methods as private. As AccessibleObject#setAccessible() is called nowhere (we also don't want this), the SnowballProgram main class cannot call the method. Unfortunately it completely supresses all reflection errors and returns false. SnowballProgram then thinks, the called private boolean r_A() method returned false and so the whole snowball algorithm breaks.

      Also the snowball stemmer classes had a bug: methodObject is a static instance of the Stemmer, and Among calls the private method then on wrong object. So when fixing the above, this would fail again (because the stemmer changes a static singleton object, not itsself!). We also need to change the object in SnowballProgram to use "this" when calling the method.

      There are several possibilities to solve the private methods problem:

      • call AccessibleObject.setAccessible(true) -> we don't want this! Never Ever!
      • patch the whole Snowball-generated classes to make those methods "public". This is a huge patch and shows many internals => no way
      • Since Java 7 we can use a cool trick, my favourite: MethodHandles

      The MethodHandles trick can be done the following way:

      • Change the Among class to get a MethodHandles.Lookup in its ctor (instead of the broken static SnowballProgram instance methodObject). This lookup is used to lookup the method by name. This Lookup Object must be private to our implementing class (the one that defines the private methods). If this is the case, we can get a method handle and call the method (from anywhere, because we "own" it).
      • replace the (already broken) methodObject in every stemmer to instead be the correct MethodHandles.Lookup of the implementing class containing the private method. This has 2 effects: The class to get the lookup gets all access rights to the class calling it (so we can access private methods) and also sees all methods.

      The second part is done by a new Ant task: ant patch-snowball. Whenever you add a new Snowball stemmer, copy the java file generated by the snowball compiler to the ext directory and call the ant task. It will patch the methodObject declaration (and also add a @SuppressWarnings("unused") for convenience).

      I reenabled the dictionary tests and the whole stemmer works. There are also no issues with security managers, because we do nothing security sensitive (all is our own stuff, no setAccessible).

      And finally: LovinsStemmer and FinnishStemmer got insanely fast (and correct, of course).

      1. LUCENE-6742.patch
        35 kB
        Uwe Schindler
      2. LUCENE-6742.patch
        32 kB
        Uwe Schindler
      3. LUCENE-6742.patch
        32 kB
        Uwe Schindler
      4. LUCENE-6742-Java8.patch
        521 kB
        Uwe Schindler

        Activity

        Hide
        Uwe Schindler added a comment -

        Here is the patch. To test the ANT task, revert all changes in org/tartarus/snowball/ext and run ant patch-snowball from the analysis/common module.

        Show
        Uwe Schindler added a comment - Here is the patch. To test the ANT task, revert all changes in org/tartarus/snowball/ext and run ant patch-snowball from the analysis/common module.
        Hide
        Uwe Schindler added a comment -

        Small fixes to patcher (charset).

        Show
        Uwe Schindler added a comment - Small fixes to patcher (charset).
        Hide
        Robert Muir added a comment -

        Big +1, I love that it fixes the correctness issues here.

        As a followup, i would like to sync us up to snowball on github, since right now its difficult to add new stemmers (see the Lithuanian issue). It would be good to commit this one first and then we can iterate on that.

        Show
        Robert Muir added a comment - Big +1, I love that it fixes the correctness issues here. As a followup, i would like to sync us up to snowball on github, since right now its difficult to add new stemmers (see the Lithuanian issue). It would be good to commit this one first and then we can iterate on that.
        Hide
        Robert Muir added a comment -

        We should also send this fix as a PR to snowball, I think its important since it fixes problems with correctness. If you don't have time, let me know and I can try to do it.

        Show
        Robert Muir added a comment - We should also send this fix as a PR to snowball, I think its important since it fixes problems with correctness. If you don't have time, let me know and I can try to do it.
        Hide
        Uwe Schindler added a comment -

        Yes, we should do this! Interestingly, the source code of their java impl is still on revision 503 (one after ours, but no changes at all). So their generator code does not fit their java impl...: https://github.com/snowballstem/snowball/tree/master/java/org/tartarus/snowball

        The code their SnowballCompiler generated was in fact "correct", but the SnowballProgram algorithm was broken.

        Show
        Uwe Schindler added a comment - Yes, we should do this! Interestingly, the source code of their java impl is still on revision 503 (one after ours, but no changes at all). So their generator code does not fit their java impl...: https://github.com/snowballstem/snowball/tree/master/java/org/tartarus/snowball The code their SnowballCompiler generated was in fact "correct", but the SnowballProgram algorithm was broken.
        Hide
        Uwe Schindler added a comment -

        New patch. I also added some documentation to Javadocs and README.txt how to get the "revision 502" from Github (I added the commit hash link).

        Robert Muir: Can you check if the version 502 as linked works for regenerating? In the comments you mentioned that the stemmers were also patched. As we already have the "snowball-patch" command, we may also add more regexes to fix the compiled classes and adapt API (if thats easy).

        Show
        Uwe Schindler added a comment - New patch. I also added some documentation to Javadocs and README.txt how to get the "revision 502" from Github (I added the commit hash link). Robert Muir : Can you check if the version 502 as linked works for regenerating? In the comments you mentioned that the stemmers were also patched. As we already have the "snowball-patch" command, we may also add more regexes to fix the compiled classes and adapt API (if thats easy).
        Hide
        Uwe Schindler added a comment -

        For completeness I also played around with rewriting more of the Among call for Java 8. This may be a the most elegant solution, as it has no reflection at all, but requires heavy pathcing (and only works with Java 8):

        • Among now takes a functional interface as last parameter to the ctor (or null, if no function, previously empty string)
        • The compiled classes use XxxStemmer::r_XX method references (Java 8). This ensures correctness already at compile time! This required most patching!
        • SnowballProgramm now just uses the functional interface to call the method (if not null).

        Actually, its much less code, speed should be the same like the Java 7 MethodHandles solution, but more elegant and safe at compile time.

        I would like to commit the Java 7 version to trunk and 5.x, but leave the Java 8 version here as "example", how to implement the C function pointers in Java (Java 8's method references are nothing more than "safe" C function pointers!).

        Show
        Uwe Schindler added a comment - For completeness I also played around with rewriting more of the Among call for Java 8. This may be a the most elegant solution, as it has no reflection at all, but requires heavy pathcing (and only works with Java 8): Among now takes a functional interface as last parameter to the ctor (or null, if no function, previously empty string) The compiled classes use XxxStemmer::r_XX method references (Java 8). This ensures correctness already at compile time! This required most patching! SnowballProgramm now just uses the functional interface to call the method (if not null). Actually, its much less code, speed should be the same like the Java 7 MethodHandles solution, but more elegant and safe at compile time. I would like to commit the Java 7 version to trunk and 5.x, but leave the Java 8 version here as "example", how to implement the C function pointers in Java (Java 8's method references are nothing more than "safe" C function pointers!).
        Hide
        Robert Muir added a comment -

        Sorry, there is no revision 502 anymore. Now there is only shitty git. So its difficult to regenerate the stemmers. That needs to be another issue.

        Show
        Robert Muir added a comment - Sorry, there is no revision 502 anymore. Now there is only shitty git. So its difficult to regenerate the stemmers. That needs to be another issue.
        Hide
        Uwe Schindler added a comment -

        The revision 502 is still there. See the github link in my patch. This is exactly rev 502 from the old svn.

        I added it to rwadme and comments.

        you should be able to download it and use it. Just don't use HEAD.

        Show
        Uwe Schindler added a comment - The revision 502 is still there. See the github link in my patch. This is exactly rev 502 from the old svn. I added it to rwadme and comments. you should be able to download it and use it. Just don't use HEAD.
        Hide
        Robert Muir added a comment -

        But as i already said, regeneration is broken today in trunk if you do that. it should really be a separate issue.

        Show
        Robert Muir added a comment - But as i already said, regeneration is broken today in trunk if you do that. it should really be a separate issue.
        Hide
        Uwe Schindler added a comment -

        I agree. I just pointed the documentation in my latest patch to the revision 502 which has now a differenbt GIT ID: https://github.com/snowballstem/snowball/tree/e103b5c257383ee94a96e7fc58cab3c567bf079b
        This lists the following as commit message:
        git-svn-id: svn+userv://snowball.tartarus.org/snowball/trunk/snowball@502 633ccae0-01f4-0310-8c99-d3591da6f01f
        Clearly referring to rev 502. Theoretically it should be possible to use this exact github variant (click on "Download ZIP" on the right-bottom) for regenerate: https://github.com/snowballstem/snowball/archive/e103b5c257383ee94a96e7fc58cab3c567bf079b.zip

        I will open a separate issue to donate our changes back to Snowball (so we have access to inner char[]),... and also fix the C code to generate the MethodHandles.lookup() that I patched in. Maybe the snowball people also want the much more safe Java 8 method reference implementation, but this is all separate.

        We should also investigate later if the linear scan through all amongs is really the best idea, maybe we can improve that, too - but also at snowball's repo.

        Finally: The Java 8 patch is just for reference how it could look like - but there is too many patching of generated files in it. So I would stay with the MethodHandles variant thats bugfree and fast, it just uses still the "string names" of the private methods, but works perfectly. I just attached the patch here to "save my work" and show you, Robert Muir, how the compiled result could look like.

        Show
        Uwe Schindler added a comment - I agree. I just pointed the documentation in my latest patch to the revision 502 which has now a differenbt GIT ID: https://github.com/snowballstem/snowball/tree/e103b5c257383ee94a96e7fc58cab3c567bf079b This lists the following as commit message: git-svn-id: svn+userv://snowball.tartarus.org/snowball/trunk/snowball@502 633ccae0-01f4-0310-8c99-d3591da6f01f Clearly referring to rev 502. Theoretically it should be possible to use this exact github variant (click on "Download ZIP" on the right-bottom) for regenerate: https://github.com/snowballstem/snowball/archive/e103b5c257383ee94a96e7fc58cab3c567bf079b.zip I will open a separate issue to donate our changes back to Snowball (so we have access to inner char[]),... and also fix the C code to generate the MethodHandles.lookup() that I patched in. Maybe the snowball people also want the much more safe Java 8 method reference implementation, but this is all separate. We should also investigate later if the linear scan through all amongs is really the best idea, maybe we can improve that, too - but also at snowball's repo. Finally: The Java 8 patch is just for reference how it could look like - but there is too many patching of generated files in it. So I would stay with the MethodHandles variant thats bugfree and fast, it just uses still the "string names" of the private methods, but works perfectly. I just attached the patch here to "save my work" and show you, Robert Muir , how the compiled result could look like.
        Hide
        ASF subversion and git services added a comment -

        Commit 1696147 from Uwe Schindler in branch 'dev/trunk'
        [ https://svn.apache.org/r1696147 ]

        LUCENE-6742: Lovins & Finnish implementation of SnowballFilter was fixed to behave exactly as specified. A bug in the snowball compiler caused differences in output of the filter in comparison to the original test data. In addition, the performance of those filters was improved significantly

        Show
        ASF subversion and git services added a comment - Commit 1696147 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1696147 ] LUCENE-6742 : Lovins & Finnish implementation of SnowballFilter was fixed to behave exactly as specified. A bug in the snowball compiler caused differences in output of the filter in comparison to the original test data. In addition, the performance of those filters was improved significantly
        Hide
        ASF subversion and git services added a comment -

        Commit 1696148 from Uwe Schindler in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1696148 ]

        Merged revision(s) 1696147 from lucene/dev/trunk:
        LUCENE-6742: Lovins & Finnish implementation of SnowballFilter was fixed to behave exactly as specified. A bug in the snowball compiler caused differences in output of the filter in comparison to the original test data. In addition, the performance of those filters was improved significantly

        Show
        ASF subversion and git services added a comment - Commit 1696148 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1696148 ] Merged revision(s) 1696147 from lucene/dev/trunk: LUCENE-6742 : Lovins & Finnish implementation of SnowballFilter was fixed to behave exactly as specified. A bug in the snowball compiler caused differences in output of the filter in comparison to the original test data. In addition, the performance of those filters was improved significantly
        Hide
        Uwe Schindler added a comment -

        Thanks Robert for review!

        Show
        Uwe Schindler added a comment - Thanks Robert for review!

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Uwe Schindler
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development