PyLucene
  1. PyLucene
  2. PYLUCENE-12

Add PythonReusableAnalyzerBase, so we can create analyzers in Python

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Labels:
      None

      Description

      Lucene now has a useful helper class, ReusableAnalyzerBase; you subclass it and override one method, to create an analyzer that provides reusableTokenStream impl.

      I think we should expose it in Python... patch is simple.

      1. PYLUCENE-12.patch
        4 kB
        Michael McCandless
      2. PYLUCENE-12.patch
        4 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        RE the exception inside createComponents... strange! Your exception indeed has all the details (ie, shows the original traceback, from the createComponents method).

        Yet, when I do exactly that change (stick the x in, then run the test case directly, I get this:

        ======================================================================
        ERROR: testReusable (_main_.ReusableAnalyzerBaseTestCase)
        ----------------------------------------------------------------------
        Traceback (most recent call last):
        File "test/test_ReusableAnalyzerBase.py", line 36, in testReusable
        stream = method("test", reader)
        JavaError: java.lang.RuntimeException: NameError
        Java stacktrace:
        java.lang.RuntimeException: NameError
        at org.apache.pylucene.analysis.PythonReusableAnalyzerBase.createComponents(Native Method)
        at org.apache.lucene.analysis.ReusableAnalyzerBase.reusableTokenStream(ReusableAnalyzerBase.java:73)

        Ie, for some reason, I don't get the traceback from the createComponents method; all I see is that a NameError had happened, not what name in particular, and what lines of Python source.

        I'm on Linux, Python 64 bit, Java 1.6.0_21... I wonder if I somehow compiled things incorrectly? Odd.

        Show
        Michael McCandless added a comment - RE the exception inside createComponents... strange! Your exception indeed has all the details (ie, shows the original traceback, from the createComponents method). Yet, when I do exactly that change (stick the x in, then run the test case directly, I get this: ====================================================================== ERROR: testReusable (_ main _.ReusableAnalyzerBaseTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "test/test_ReusableAnalyzerBase.py", line 36, in testReusable stream = method("test", reader) JavaError: java.lang.RuntimeException: NameError Java stacktrace: java.lang.RuntimeException: NameError at org.apache.pylucene.analysis.PythonReusableAnalyzerBase.createComponents(Native Method) at org.apache.lucene.analysis.ReusableAnalyzerBase.reusableTokenStream(ReusableAnalyzerBase.java:73) Ie, for some reason, I don't get the traceback from the createComponents method; all I see is that a NameError had happened, not what name in particular, and what lines of Python source. I'm on Linux, Python 64 bit, Java 1.6.0_21... I wonder if I somehow compiled things incorrectly? Odd.
        Hide
        Michael McCandless added a comment -

        Re not SEGVing if you fail to call super ... OK, if we can't find a non-costly way to do it, let's not!

        Show
        Michael McCandless added a comment - Re not SEGVing if you fail to call super ... OK, if we can't find a non-costly way to do it, let's not!
        Hide
        Michael McCandless added a comment -

        Sorry, could you also add this method to PythonReusableAnalyzerBase.java (I missed it in my first patch):

        @Override
        public native Reader initReader(Reader reader);

        Separately: how do we turn on Jira's markup like

        
        

        and comment previews here

        Show
        Michael McCandless added a comment - Sorry, could you also add this method to PythonReusableAnalyzerBase.java (I missed it in my first patch): @Override public native Reader initReader(Reader reader); Separately: how do we turn on Jira's markup like and comment previews here
        Hide
        Andi Vajda added a comment -

        rev 1209356, thanks Mike !

        Show
        Andi Vajda added a comment - rev 1209356, thanks Mike !
        Hide
        Andi Vajda added a comment -

        About the lack of information in the stacktrace, I added a random x into the createComponents method and I'm getting this:

        Traceback (most recent call last):
          File "test/test_ReusableAnalyzerBase.py", line 36, in testReusable
            stream = method("test", reader)
        JavaError: org.apache.jcc.PythonException: global name 'xfirst' is not defined
        Traceback (most recent call last):
          File "test/test_ReusableAnalyzerBase.py", line 24, in createComponents
            last = StopFilter(Version.LUCENE_CURRENT, xfirst, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
        NameError: global name 'xfirst' is not defined
        
            Java stacktrace:
        org.apache.jcc.PythonException: global name 'xfirst' is not defined
        Traceback (most recent call last):
          File "test/test_ReusableAnalyzerBase.py", line 24, in createComponents
            last = StopFilter(Version.LUCENE_CURRENT, xfirst, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
        NameError: global name 'xfirst' is not defined
        
        	at org.apache.pylucene.analysis.PythonReusableAnalyzerBase.createComponents(Native Method)
        	at org.apache.lucene.analysis.ReusableAnalyzerBase.reusableTokenStream(ReusableAnalyzerBase.java:73)
        

        Seems plenty of detail to me. What do you think is missing ?

        Show
        Andi Vajda added a comment - About the lack of information in the stacktrace, I added a random x into the createComponents method and I'm getting this: Traceback (most recent call last): File "test/test_ReusableAnalyzerBase.py", line 36, in testReusable stream = method("test", reader) JavaError: org.apache.jcc.PythonException: global name 'xfirst' is not defined Traceback (most recent call last): File "test/test_ReusableAnalyzerBase.py", line 24, in createComponents last = StopFilter(Version.LUCENE_CURRENT, xfirst, StopAnalyzer.ENGLISH_STOP_WORDS_SET) NameError: global name 'xfirst' is not defined Java stacktrace: org.apache.jcc.PythonException: global name 'xfirst' is not defined Traceback (most recent call last): File "test/test_ReusableAnalyzerBase.py", line 24, in createComponents last = StopFilter(Version.LUCENE_CURRENT, xfirst, StopAnalyzer.ENGLISH_STOP_WORDS_SET) NameError: global name 'xfirst' is not defined at org.apache.pylucene.analysis.PythonReusableAnalyzerBase.createComponents(Native Method) at org.apache.lucene.analysis.ReusableAnalyzerBase.reusableTokenStream(ReusableAnalyzerBase.java:73) Seems plenty of detail to me. What do you think is missing ?
        Hide
        Andi Vajda added a comment -

        you say: "I know we document that you must call super (http://lucene.apache.org/pylucene/jcc/documentation/readme.html#extensions), but, can we make this throw an exception instead of SEGV, to be more friendly? Or is that hard...? "

        It's not hard, just costly. Everywhere the wrapped pointer is used, it must be checked. It's like checking for lack of calling initVM() or
        attachCurrentThread(). It took a while to find the right way to do this that didn't involve checking these all the time.

        Show
        Andi Vajda added a comment - you say: "I know we document that you must call super ( http://lucene.apache.org/pylucene/jcc/documentation/readme.html#extensions ), but, can we make this throw an exception instead of SEGV, to be more friendly? Or is that hard...? " It's not hard, just costly. Everywhere the wrapped pointer is used, it must be checked. It's like checking for lack of calling initVM() or attachCurrentThread(). It took a while to find the right way to do this that didn't involve checking these all the time.
        Hide
        Michael McCandless added a comment -

        One small fix to the patch: we also must add this:

        @Override
        public native Reader initReader(Reader reader);

        So that the Python defined analyzer can provide a CharReader/Filter as well.

        Show
        Michael McCandless added a comment - One small fix to the patch: we also must add this: @Override public native Reader initReader(Reader reader); So that the Python defined analyzer can provide a CharReader/Filter as well.
        Hide
        Michael McCandless added a comment -

        Hmm, one more unfriendliness: if the createComponents method throws an exception (eg put xxx in there so you hit a NameError), you get back an exception like this:

        ======================================================================
        ERROR: testReusable (__main__.ReusableAnalyzerBaseTestCase)
        ----------------------------------------------------------------------
        Traceback (most recent call last):
          File "test/test_ReusableAnalyzerBase.py", line 37, in testReusable
            stream = method("test", reader)
        JavaError: java.lang.RuntimeException: NameError
            Java stacktrace:
        java.lang.RuntimeException: NameError
        	at org.apache.pylucene.analysis.PythonReusableAnalyzerBase.createComponents(Native Method)
        	at org.apache.lucene.analysis.ReusableAnalyzerBase.reusableTokenStream(ReusableAnalyzerBase.java:73)
        

        Somehow this is missing details (exception cause & TB) of the python source that caused the exception.... can we fix this? If it's tricky I can open a new issue...

        Show
        Michael McCandless added a comment - Hmm, one more unfriendliness: if the createComponents method throws an exception (eg put xxx in there so you hit a NameError), you get back an exception like this: ====================================================================== ERROR: testReusable (__main__.ReusableAnalyzerBaseTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "test/test_ReusableAnalyzerBase.py", line 37, in testReusable stream = method("test", reader) JavaError: java.lang.RuntimeException: NameError Java stacktrace: java.lang.RuntimeException: NameError at org.apache.pylucene.analysis.PythonReusableAnalyzerBase.createComponents(Native Method) at org.apache.lucene.analysis.ReusableAnalyzerBase.reusableTokenStream(ReusableAnalyzerBase.java:73) Somehow this is missing details (exception cause & TB) of the python source that caused the exception.... can we fix this? If it's tricky I can open a new issue...
        Hide
        Michael McCandless added a comment -

        I noticed one unfriendliness here: if I modify the MyAnalyzer class (in test_ReusableAnalyzerBase.py), adding an empty ctor (def _init) that fails to call super's __init_, then I get a SEGV.

        I know we document that you must call super (http://lucene.apache.org/pylucene/jcc/documentation/readme.html#extensions), but, can we make this throw an exception instead of SEGV, to be more friendly? Or is that hard...?

        Show
        Michael McCandless added a comment - I noticed one unfriendliness here: if I modify the MyAnalyzer class (in test_ReusableAnalyzerBase.py), adding an empty ctor (def _ init ) that fails to call super's __init _, then I get a SEGV. I know we document that you must call super ( http://lucene.apache.org/pylucene/jcc/documentation/readme.html#extensions ), but, can we make this throw an exception instead of SEGV, to be more friendly? Or is that hard...?
        Hide
        Michael McCandless added a comment -

        New patch (just fixes my indentation screwup from last one).

        Show
        Michael McCandless added a comment - New patch (just fixes my indentation screwup from last one).
        Hide
        Michael McCandless added a comment -

        Patch w/ basic test.

        Show
        Michael McCandless added a comment - Patch w/ basic test.

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development