Lucene - Core
  1. Lucene - Core
  2. LUCENE-3731

Create a analysis/uima module for UIMA based tokenizers/analyzers

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored out in a separate module (modules/analysis/uima) as they can be used in plain Lucene. Then the solr/contrib/uima will contain only the related factories.

      1. LUCENE-3731.patch
        114 kB
        Tommaso Teofili
      2. LUCENE-3731_speed.patch
        5 kB
        Robert Muir
      3. LUCENE-3731_speed.patch
        13 kB
        Robert Muir
      4. LUCENE-3731_speed.patch
        13 kB
        Robert Muir
      5. LUCENE-3731_rsrel.patch
        4 kB
        Tommaso Teofili
      6. LUCENE-3731_4.patch
        109 kB
        Tommaso Teofili
      7. LUCENE-3731_3.patch
        109 kB
        Tommaso Teofili
      8. LUCENE-3731_2.patch
        114 kB
        Tommaso Teofili

        Issue Links

          Activity

          Hide
          Tommaso Teofili added a comment -

          this patch adds a modules/analysis/uima module

          Show
          Tommaso Teofili added a comment - this patch adds a modules/analysis/uima module
          Hide
          Uwe Schindler added a comment -

          Hi,

          +      clearAttributes();
          +      AnnotationFS next = iterator.next();
          +      termAttr.setEmpty();
          +      termAttr.append(next.getCoveredText());
          +      termAttr.setLength(next.getCoveredText().length());
          

          As you clear the attributes already, the length of termAttr is 0, so setEmpty is not needed. termAttr.setLength() is also not useful, as append will initialize the length already. All you need is termAttr.append(next.getCoveredText());

          Uwe

          Show
          Uwe Schindler added a comment - Hi, + clearAttributes(); + AnnotationFS next = iterator.next(); + termAttr.setEmpty(); + termAttr.append(next.getCoveredText()); + termAttr.setLength(next.getCoveredText().length()); As you clear the attributes already, the length of termAttr is 0, so setEmpty is not needed. termAttr.setLength() is also not useful, as append will initialize the length already. All you need is termAttr.append(next.getCoveredText()); Uwe
          Hide
          Tommaso Teofili added a comment -

          right Uwe, thanks so much for the quick review

          Show
          Tommaso Teofili added a comment - right Uwe, thanks so much for the quick review
          Hide
          Robert Muir added a comment -

          Thanks for starting this Tommaso:

          I was unable to apply the patch (were there some svn-copies?)

          But I suggest in general using the BaseTokenStreamTestCase.assertTokenStreamContents/assertAnalyzesTo:
          e.g. instead of:

          // check that 'the big brown fox jumped on the wood' tokens have the expected PoS types
             String[] expectedPos = new String[]{"at", "jj", "jj", "nn", "vbd", "in", "at", "nn"};
             int i = 0;
             while (ts.incrementToken()) {
               assertNotNull(offsetAtt);
               assertNotNull(termAtt);
               assertNotNull(typeAttr);
               assertEquals(typeAttr.type(), expectedPos[i]);
               i++;
             }
          

          you could use:

             assertTokenStreamContents(ts, 
               new String[] { "the", "big", "brown", ... }, /* expected terms */
               new String[] { "at", "jj", "jj", ... }, /* expected types */
          

          There are also variants that let you supply expected start/end offsets, I think that would be good.

          Finally, to check for lots of other bugs (including thread-safety, compatibility with charfilters, etc),
          I would recommend:

            /** blast some random strings through the analyzer */
            public void testRandomStrings() throws Exception {
              Analyzer a = new Analyzer() {
          
                @Override
                protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
                  Tokenizer tokenizer = new MyTokenizer(reader);
                  return new TokenStreamComponents(tokenizer, tokenizer);
                } 
              };
              checkRandomData(random, a, 10000*RANDOM_MULTIPLIER);
            }
          

          If you look at BaseTokenStreamTestCase you will see all of these methods are insanely nitpicky
          and find all kinds of bugs in analysis components, so it will really help test coverage I think.

          Show
          Robert Muir added a comment - Thanks for starting this Tommaso: I was unable to apply the patch (were there some svn-copies?) But I suggest in general using the BaseTokenStreamTestCase.assertTokenStreamContents/assertAnalyzesTo: e.g. instead of: // check that 'the big brown fox jumped on the wood' tokens have the expected PoS types String [] expectedPos = new String []{ "at" , "jj" , "jj" , "nn" , "vbd" , "in" , "at" , "nn" }; int i = 0; while (ts.incrementToken()) { assertNotNull(offsetAtt); assertNotNull(termAtt); assertNotNull(typeAttr); assertEquals(typeAttr.type(), expectedPos[i]); i++; } you could use: assertTokenStreamContents(ts, new String [] { "the" , "big" , "brown" , ... }, /* expected terms */ new String [] { "at" , "jj" , "jj" , ... }, /* expected types */ There are also variants that let you supply expected start/end offsets, I think that would be good. Finally, to check for lots of other bugs (including thread-safety, compatibility with charfilters, etc), I would recommend: /** blast some random strings through the analyzer */ public void testRandomStrings() throws Exception { Analyzer a = new Analyzer() { @Override protected TokenStreamComponents createComponents( String fieldName, Reader reader) { Tokenizer tokenizer = new MyTokenizer(reader); return new TokenStreamComponents(tokenizer, tokenizer); } }; checkRandomData(random, a, 10000*RANDOM_MULTIPLIER); } If you look at BaseTokenStreamTestCase you will see all of these methods are insanely nitpicky and find all kinds of bugs in analysis components, so it will really help test coverage I think.
          Hide
          Tommaso Teofili added a comment -

          Hey Robert, that's super, thanks! I'm going to collect your suggestions in a new patch shortly.

          Show
          Tommaso Teofili added a comment - Hey Robert, that's super, thanks! I'm going to collect your suggestions in a new patch shortly.
          Hide
          Tommaso Teofili added a comment -

          Updated patch which incorporates Robert's suggestions.
          The random strings testing highlights some corner cases where the endOffset is not set correctly, probably due to Redear to String explicit conversion in BaseUIMATokenizer which needs to get rid of line.separator property.

          New patch to fix the above will follow.

          Show
          Tommaso Teofili added a comment - Updated patch which incorporates Robert's suggestions. The random strings testing highlights some corner cases where the endOffset is not set correctly, probably due to Redear to String explicit conversion in BaseUIMATokenizer which needs to get rid of line.separator property. New patch to fix the above will follow.
          Hide
          Tommaso Teofili added a comment -

          Updated patch which fixes corner cases with wrong endOffsets.

          Show
          Tommaso Teofili added a comment - Updated patch which fixes corner cases with wrong endOffsets.
          Hide
          Robert Muir added a comment -

          Hi Tommaso, I think it would be cleaner to set the final offset in end() instead?

          Show
          Robert Muir added a comment - Hi Tommaso, I think it would be cleaner to set the final offset in end() instead?
          Hide
          Tommaso Teofili added a comment -

          Hi Tommaso, I think it would be cleaner to set the final offset in end() instead?

          ok, +1.

          Show
          Tommaso Teofili added a comment - Hi Tommaso, I think it would be cleaner to set the final offset in end() instead? ok, +1.
          Hide
          Tommaso Teofili added a comment -

          patch with finalOffset setting in the end() method.

          Show
          Tommaso Teofili added a comment - patch with finalOffset setting in the end() method.
          Hide
          Tommaso Teofili added a comment -

          I'm going to commit this one shortly if no one objects.

          Show
          Tommaso Teofili added a comment - I'm going to commit this one shortly if no one objects.
          Hide
          Robert Muir added a comment -

          Thanks for factoring this out Tommaso.

          Show
          Robert Muir added a comment - Thanks for factoring this out Tommaso.
          Hide
          Tommaso Teofili added a comment -

          committed on trunk in r1244236

          Show
          Tommaso Teofili added a comment - committed on trunk in r1244236
          Hide
          Steve Rowe added a comment -

          Hi Tommaso,

          I just committed modifications to the IntelliJ IDEA and Maven configurations.

          Something strange is happening, though: one test method consistently fails under both IntelliJ and Maven: UIMABaseAnalyzerTest.testRandomStrings(). However, under Ant, this always succeeds, including with the seeds that fail under either IntelliJ or Maven. Also, under both IntelliJ and Maven, the following sequence is printed out literally thousands of times to STDERR (with increasing time stamps) - however, I don't see this at all under Ant:

          Feb 14, 2012 6:34:18 PM WhitespaceTokenizer initialize
          INFO: "Whitespace tokenizer successfully initialized"
          Feb 14, 2012 6:34:18 PM WhitespaceTokenizer typeSystemInit
          INFO: "Whitespace tokenizer typesystem initialized"
          Feb 14, 2012 6:34:18 PM WhitespaceTokenizer process
          INFO: "Whitespace tokenizer starts processing"
          Feb 14, 2012 6:34:18 PM WhitespaceTokenizer process
          INFO: "Whitespace tokenizer finished processing"
          

          Here are two different example failures, from Maven - they seem to have different causes, which is baffling:

          The following exceptions were thrown by threads:
          *** Thread: Thread-1 ***
          java.lang.RuntimeException: java.io.IOException: org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:289)
          Caused by: java.io.IOException: org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
                  at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:73)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:333)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287)
          Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
                  at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:391)
                  at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295)
                  at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567)
                  at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:409)
                  at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342)
                  at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267)
                  at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
                  at org.apache.lucene.analysis.uima.BaseUIMATokenizer.analyzeInput(BaseUIMATokenizer.java:57)
                  at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.analyzeText(UIMAAnnotationsTokenizer.java:61)
                  at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:71)
                  ... 3 more
          Caused by: java.lang.NullPointerException
                  at org.apache.uima.impl.UimaContext_ImplBase$ComponentInfoImpl.mapToSofaID(UimaContext_ImplBase.java:655)
                  at org.apache.uima.cas.impl.CASImpl.getView(CASImpl.java:2646)
                  at org.apache.uima.jcas.impl.JCasImpl.getView(JCasImpl.java:1415)
                  at org.apache.uima.examples.tagger.HMMTagger.process(HMMTagger.java:250)
                  at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
                  at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377)
                  ... 12 more
          *** Thread: Thread-2 ***
          java.lang.AssertionError: token 0 does not exist
                  at org.junit.Assert.fail(Assert.java:93)
                  at org.junit.Assert.assertTrue(Assert.java:43)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:121)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:371)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287)
          NOTE: reproduce with: ant test -Dtestcase=UIMABaseAnalyzerTest -Dtestmethod=testRandomStrings(org.apache.lucene.analysis.uima.UIMABaseAnalyzerTest) -Dtests.seed=-1dad4a7ede576939:-f9f5c77dffb3eb0:607bf59bf7da50eb -Dargs="-Dfile.encoding=Cp1252"
          
          The following exceptions were thrown by threads:
          *** Thread: Thread-5 ***
          java.lang.RuntimeException: java.io.IOException: org.apache.uima.analysis_engine.AnalysisEngineProcessException
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:289)
          Caused by: java.io.IOException: org.apache.uima.analysis_engine.AnalysisEngineProcessException
                  at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:73)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:121)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:371)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287)
          Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException
                  at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:701)
                  at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:409)
                  at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342)
                  at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267)
                  at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
                  at org.apache.lucene.analysis.uima.BaseUIMATokenizer.analyzeInput(BaseUIMATokenizer.java:57)
                  at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.analyzeText(UIMAAnnotationsTokenizer.java:61)
                  at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:71)
                  ... 4 more
          Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 2
                  at java.util.ArrayList.RangeCheck(ArrayList.java:547)
                  at java.util.ArrayList.get(ArrayList.java:322)
                  at org.apache.uima.flow.impl.FixedFlowController$FixedFlowObject.next(FixedFlowController.java:222)
                  at org.apache.uima.analysis_engine.asb.impl.FlowContainer.next(FlowContainer.java:100)
                  at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:546)
                  ... 11 more
          *** Thread: Thread-7 ***
          java.lang.RuntimeException: java.io.IOException: org.apache.uima.analysis_engine.AnalysisEngineProcessException
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:289)
          Caused by: java.io.IOException: org.apache.uima.analysis_engine.AnalysisEngineProcessException
                  at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:73)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:121)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:371)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287)
          Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException
                  at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:701)
                  at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:409)
                  at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342)
                  at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267)
                  at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
                  at org.apache.lucene.analysis.uima.BaseUIMATokenizer.analyzeInput(BaseUIMATokenizer.java:57)
                  at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.analyzeText(UIMAAnnotationsTokenizer.java:61)
                  at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:71)
                  ... 4 more
          Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 2
                  at java.util.ArrayList.RangeCheck(ArrayList.java:547)
                  at java.util.ArrayList.get(ArrayList.java:322)
                  at org.apache.uima.flow.impl.FixedFlowController$FixedFlowObject.next(FixedFlowController.java:222)
                  at org.apache.uima.analysis_engine.asb.impl.FlowContainer.next(FlowContainer.java:100)
                  at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:546)
                  ... 11 more
          *** Thread: Thread-6 ***
          java.lang.AssertionError: end of stream
                  at org.junit.Assert.fail(Assert.java:93)
                  at org.junit.Assert.assertTrue(Assert.java:43)
                  at org.junit.Assert.assertFalse(Assert.java:68)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:148)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:371)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287)
          *** Thread: Thread-4 ***
          org.junit.ComparisonFailure: term 8 expected:<-[]> but was:<-[- f(]>
                  at org.junit.Assert.assertEquals(Assert.java:125)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:124)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:371)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295)
                  at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287)
          NOTE: reproduce with: ant test -Dtestcase=UIMABaseAnalyzerTest -Dtestmethod=testRandomStrings(org.apache.lucene.analysis.uima.UIMABaseAnalyzerTest) -Dtests.seed=2be0c24a1df9b25e:-42f203968285c6ed:5f8c85cdbae32724 -Dargs="-Dfile.encoding=Cp1252"
          
          Show
          Steve Rowe added a comment - Hi Tommaso, I just committed modifications to the IntelliJ IDEA and Maven configurations. Something strange is happening, though: one test method consistently fails under both IntelliJ and Maven: UIMABaseAnalyzerTest.testRandomStrings() . However, under Ant, this always succeeds, including with the seeds that fail under either IntelliJ or Maven. Also, under both IntelliJ and Maven, the following sequence is printed out literally thousands of times to STDERR (with increasing time stamps) - however, I don't see this at all under Ant: Feb 14, 2012 6:34:18 PM WhitespaceTokenizer initialize INFO: "Whitespace tokenizer successfully initialized" Feb 14, 2012 6:34:18 PM WhitespaceTokenizer typeSystemInit INFO: "Whitespace tokenizer typesystem initialized" Feb 14, 2012 6:34:18 PM WhitespaceTokenizer process INFO: "Whitespace tokenizer starts processing" Feb 14, 2012 6:34:18 PM WhitespaceTokenizer process INFO: "Whitespace tokenizer finished processing" Here are two different example failures, from Maven - they seem to have different causes, which is baffling: The following exceptions were thrown by threads: *** Thread: Thread-1 *** java.lang.RuntimeException: java.io.IOException: org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed. at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:289) Caused by: java.io.IOException: org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed. at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:73) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:333) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295) at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287) Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed. at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:391) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295) at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567) at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:409) at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342) at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267) at org.apache.lucene.analysis.uima.BaseUIMATokenizer.analyzeInput(BaseUIMATokenizer.java:57) at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.analyzeText(UIMAAnnotationsTokenizer.java:61) at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:71) ... 3 more Caused by: java.lang.NullPointerException at org.apache.uima.impl.UimaContext_ImplBase$ComponentInfoImpl.mapToSofaID(UimaContext_ImplBase.java:655) at org.apache.uima.cas.impl.CASImpl.getView(CASImpl.java:2646) at org.apache.uima.jcas.impl.JCasImpl.getView(JCasImpl.java:1415) at org.apache.uima.examples.tagger.HMMTagger.process(HMMTagger.java:250) at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377) ... 12 more *** Thread: Thread-2 *** java.lang.AssertionError: token 0 does not exist at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:121) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:371) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295) at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287) NOTE: reproduce with: ant test -Dtestcase=UIMABaseAnalyzerTest -Dtestmethod=testRandomStrings(org.apache.lucene.analysis.uima.UIMABaseAnalyzerTest) -Dtests.seed=-1dad4a7ede576939:-f9f5c77dffb3eb0:607bf59bf7da50eb -Dargs="-Dfile.encoding=Cp1252" The following exceptions were thrown by threads: *** Thread: Thread-5 *** java.lang.RuntimeException: java.io.IOException: org.apache.uima.analysis_engine.AnalysisEngineProcessException at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:289) Caused by: java.io.IOException: org.apache.uima.analysis_engine.AnalysisEngineProcessException at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:73) at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:121) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:371) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295) at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287) Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:701) at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:409) at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342) at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267) at org.apache.lucene.analysis.uima.BaseUIMATokenizer.analyzeInput(BaseUIMATokenizer.java:57) at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.analyzeText(UIMAAnnotationsTokenizer.java:61) at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:71) ... 4 more Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 2 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.uima.flow.impl.FixedFlowController$FixedFlowObject.next(FixedFlowController.java:222) at org.apache.uima.analysis_engine.asb.impl.FlowContainer.next(FlowContainer.java:100) at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:546) ... 11 more *** Thread: Thread-7 *** java.lang.RuntimeException: java.io.IOException: org.apache.uima.analysis_engine.AnalysisEngineProcessException at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:289) Caused by: java.io.IOException: org.apache.uima.analysis_engine.AnalysisEngineProcessException at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:73) at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:121) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:371) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295) at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287) Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:701) at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:409) at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342) at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267) at org.apache.lucene.analysis.uima.BaseUIMATokenizer.analyzeInput(BaseUIMATokenizer.java:57) at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.analyzeText(UIMAAnnotationsTokenizer.java:61) at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:71) ... 4 more Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 2 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.uima.flow.impl.FixedFlowController$FixedFlowObject.next(FixedFlowController.java:222) at org.apache.uima.analysis_engine.asb.impl.FlowContainer.next(FlowContainer.java:100) at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:546) ... 11 more *** Thread: Thread-6 *** java.lang.AssertionError: end of stream at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertFalse(Assert.java:68) at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:148) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:371) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295) at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287) *** Thread: Thread-4 *** org.junit.ComparisonFailure: term 8 expected:<-[]> but was:<-[- f(]> at org.junit.Assert.assertEquals(Assert.java:125) at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:124) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:371) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295) at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287) NOTE: reproduce with: ant test -Dtestcase=UIMABaseAnalyzerTest -Dtestmethod=testRandomStrings(org.apache.lucene.analysis.uima.UIMABaseAnalyzerTest) -Dtests.seed=2be0c24a1df9b25e:-42f203968285c6ed:5f8c85cdbae32724 -Dargs="-Dfile.encoding=Cp1252"
          Hide
          Tommaso Teofili added a comment - - edited

          Thank you very much Steven for reporting.

          The

          Feb 14, 2012 6:34:18 PM WhitespaceTokenizer initialize
          INFO: "Whitespace tokenizer successfully initialized"
          Feb 14, 2012 6:34:18 PM WhitespaceTokenizer typeSystemInit
          INFO: "Whitespace tokenizer typesystem initialized"
          

          messages are due to UIMA WhitespaceTokenizer Annotator which logs the initialization/processing/etc. calls.
          That is printed out many times because the testRandomStrings test method just does lots of tricky tests on the UIMABaseAnalyzer which require the above calls to be executed repeatedly.

          I'll take a look to the other failures which didn't show up on the tests I had done till now.

          Show
          Tommaso Teofili added a comment - - edited Thank you very much Steven for reporting. The Feb 14, 2012 6:34:18 PM WhitespaceTokenizer initialize INFO: "Whitespace tokenizer successfully initialized" Feb 14, 2012 6:34:18 PM WhitespaceTokenizer typeSystemInit INFO: "Whitespace tokenizer typesystem initialized" messages are due to UIMA WhitespaceTokenizer Annotator which logs the initialization/processing/etc. calls. That is printed out many times because the testRandomStrings test method just does lots of tricky tests on the UIMABaseAnalyzer which require the above calls to be executed repeatedly. I'll take a look to the other failures which didn't show up on the tests I had done till now.
          Hide
          Tommaso Teofili added a comment -

          Ok, I noticed this was due to an issue on the UIMA side.
          I think the best option (as those are used just for testing) is to use a dummy implementation of both UIMA based whitespace tokenizer and PoS tagger thus also avoiding the log lines when executing tests using Maven.

          Show
          Tommaso Teofili added a comment - Ok, I noticed this was due to an issue on the UIMA side. I think the best option (as those are used just for testing) is to use a dummy implementation of both UIMA based whitespace tokenizer and PoS tagger thus also avoiding the log lines when executing tests using Maven.
          Hide
          Tommaso Teofili added a comment -

          fix for the issues reported by Steven committed in r1244474

          Show
          Tommaso Teofili added a comment - fix for the issues reported by Steven committed in r1244474
          Hide
          Robert Muir added a comment -

          Hi Tommaso, I noticed some of the tests here are really slow... digging in I think we have some perf issues:

          junit-sequential:
             [junit] Testsuite: org.apache.lucene.analysis.uima.UIMABaseAnalyzerTest
             [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 52.302 sec
             [junit]
             [junit] Testsuite: org.apache.lucene.analysis.uima.UIMATypeAwareAnalyzerTest
             [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 38.506 sec
             [junit]
             [junit] Testsuite: org.apache.lucene.analysis.uima.ae.BasicAEProviderTest
             [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.02 sec
             [junit]
             [junit] Testsuite: org.apache.lucene.analysis.uima.ae.OverridingParamsAEProviderTest
             [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.042 sec
             [junit]
          

          See the patch below (which cannot be committed yet)... currently i had to add sync around the analysis engine, but it avoids recreating the CAS for each document... which seems to cause the perf problems

          junit-sequential:
              [junit] Testsuite: org.apache.lucene.analysis.uima.UIMABaseAnalyzerTest
              [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 3.858 sec
              [junit] 
              [junit] Testsuite: org.apache.lucene.analysis.uima.UIMATypeAwareAnalyzerTest
              [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.731 sec
              [junit] 
              [junit] Testsuite: org.apache.lucene.analysis.uima.ae.BasicAEProviderTest
              [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.009 sec
              [junit] 
              [junit] Testsuite: org.apache.lucene.analysis.uima.ae.OverridingParamsAEProviderTest
              [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.058 sec
              [junit] 
          
          Show
          Robert Muir added a comment - Hi Tommaso, I noticed some of the tests here are really slow... digging in I think we have some perf issues: junit-sequential: [junit] Testsuite: org.apache.lucene.analysis.uima.UIMABaseAnalyzerTest [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 52.302 sec [junit] [junit] Testsuite: org.apache.lucene.analysis.uima.UIMATypeAwareAnalyzerTest [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 38.506 sec [junit] [junit] Testsuite: org.apache.lucene.analysis.uima.ae.BasicAEProviderTest [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.02 sec [junit] [junit] Testsuite: org.apache.lucene.analysis.uima.ae.OverridingParamsAEProviderTest [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.042 sec [junit] See the patch below (which cannot be committed yet)... currently i had to add sync around the analysis engine, but it avoids recreating the CAS for each document... which seems to cause the perf problems junit-sequential: [junit] Testsuite: org.apache.lucene.analysis.uima.UIMABaseAnalyzerTest [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 3.858 sec [junit] [junit] Testsuite: org.apache.lucene.analysis.uima.UIMATypeAwareAnalyzerTest [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.731 sec [junit] [junit] Testsuite: org.apache.lucene.analysis.uima.ae.BasicAEProviderTest [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.009 sec [junit] [junit] Testsuite: org.apache.lucene.analysis.uima.ae.OverridingParamsAEProviderTest [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.058 sec [junit]
          Hide
          Tommaso Teofili added a comment -

          Hi Robert,
          reusing the CAS is good, as you note in the patch we need to take care of how to let each tokenizer instance get its own AE, in the previous Solr version core names were used to cache and get AEs.
          As said on dev@ we may start with letting each tokenizer have its own AE and then improve the design once concurrency is fixed.
          I'm doing tests with other types of UIMA Flow controllers, right now the WhiteboardFlowController seems to behave slightly better.

          Show
          Tommaso Teofili added a comment - Hi Robert, reusing the CAS is good, as you note in the patch we need to take care of how to let each tokenizer instance get its own AE, in the previous Solr version core names were used to cache and get AEs. As said on dev@ we may start with letting each tokenizer have its own AE and then improve the design once concurrency is fixed. I'm doing tests with other types of UIMA Flow controllers, right now the WhiteboardFlowController seems to behave slightly better.
          Hide
          Robert Muir added a comment -

          Tommaso, I will make another prototype patch trying this approach.

          In my opinion the caching done in BasicAEProvider/OverridingParamsAEProvider would still useful even with
          allowing each tokenstream to have a new AE, because we would just cache the description itself
          (so we e.g. only parse xml a single time), butreturn a new AE each time... then we could remove the
          synchronized and still avoid a 'heavy' construction for first time initialization of a new thread
          (after that, the tokenstream is reused, so there is no issue).

          Ill see how it goes and upload a patch if I can make it look nice.

          Show
          Robert Muir added a comment - Tommaso, I will make another prototype patch trying this approach. In my opinion the caching done in BasicAEProvider/OverridingParamsAEProvider would still useful even with allowing each tokenstream to have a new AE, because we would just cache the description itself (so we e.g. only parse xml a single time), butreturn a new AE each time... then we could remove the synchronized and still avoid a 'heavy' construction for first time initialization of a new thread (after that, the tokenstream is reused, so there is no issue). Ill see how it goes and upload a patch if I can make it look nice.
          Hide
          Robert Muir added a comment -

          updated patch:

          • each tokenstream gets its own AnalysisEngine instance
          • providers only cache the description itself, but return new instances.
          • reuse logic factored into BaseUIMATokenizer class
          • release CAS/AE resources in BaseUIMATokenizer.close()

          Now there are only a few nocommits, in the providers, but these are mostly nitpicks and more minor

          Show
          Robert Muir added a comment - updated patch: each tokenstream gets its own AnalysisEngine instance providers only cache the description itself, but return new instances. reuse logic factored into BaseUIMATokenizer class release CAS/AE resources in BaseUIMATokenizer.close() Now there are only a few nocommits, in the providers, but these are mostly nitpicks and more minor
          Hide
          Robert Muir added a comment -

          ok new patch, with more refactoring... no more nocommits

          now the OverridingParams provider just extends the base one, but sets configuration.

          also the xml inputstream is closed.

          Show
          Robert Muir added a comment - ok new patch, with more refactoring... no more nocommits now the OverridingParams provider just extends the base one, but sets configuration. also the xml inputstream is closed.
          Hide
          Tommaso Teofili added a comment -

          Thanks Robert for taking care of this, nice improvement
          I agree on the OverridingParams extending the base one, it was also my intent to do that.

          Show
          Tommaso Teofili added a comment - Thanks Robert for taking care of this, nice improvement I agree on the OverridingParams extending the base one, it was also my intent to do that.
          Hide
          Robert Muir added a comment -

          OK, if there is no objection I will commit this one.

          I think it will fix the jenkins fails... of course sometimes it takes
          a few days of jenkins chewing on it to be sure

          Show
          Robert Muir added a comment - OK, if there is no objection I will commit this one. I think it will fix the jenkins fails... of course sometimes it takes a few days of jenkins chewing on it to be sure
          Hide
          Tommaso Teofili added a comment -

          OK, if there is no objection I will commit this one.

          +1, I'll post my progress on other possible improvements in performances I'm testing later.

          Show
          Tommaso Teofili added a comment - OK, if there is no objection I will commit this one. +1, I'll post my progress on other possible improvements in performances I'm testing later.
          Hide
          Robert Muir added a comment -

          Thanks Tommaso: i committed this.

          Also a tiny change to end() methods:

             public void end() throws IOException {
          -    if (offsetAttr.endOffset() < finalOffset)
          -      offsetAttr.setOffset(finalOffset, finalOffset);
          +    offsetAttr.setOffset(finalOffset, finalOffset);
               super.end();
             }
          

          Unless there is a bug, we should not need the if...
          Not sure if we should be reading the attribute values at this
          stage and if thats defined either, and if endOffset is somehow
          past the reader's final offset, well we are already in trouble

          I ran the tests many times and with -Dtests.multiplier=100 and there
          were no issues.

          Show
          Robert Muir added a comment - Thanks Tommaso: i committed this. Also a tiny change to end() methods: public void end() throws IOException { - if (offsetAttr.endOffset() < finalOffset) - offsetAttr.setOffset(finalOffset, finalOffset); + offsetAttr.setOffset(finalOffset, finalOffset); super .end(); } Unless there is a bug, we should not need the if... Not sure if we should be reading the attribute values at this stage and if thats defined either, and if endOffset is somehow past the reader's final offset, well we are already in trouble I ran the tests many times and with -Dtests.multiplier=100 and there were no issues.
          Hide
          Tommaso Teofili added a comment -

          Right, everything seems ok now.
          I also tried to comment the

          <property name="tests.threadspercpu" value="0" />
          

          line in build.xml in order to execute tests in parallel.
          Multiple parallel tests executions, with also -Dtests.multiplier=100, with Java6 passed flawlessly; will see if that is the case for Java7 too.

          Show
          Tommaso Teofili added a comment - Right, everything seems ok now. I also tried to comment the <property name="tests.threadspercpu" value="0" /> line in build.xml in order to execute tests in parallel. Multiple parallel tests executions, with also -Dtests.multiplier=100, with Java6 passed flawlessly; will see if that is the case for Java7 too.
          Hide
          Tommaso Teofili added a comment -

          some improvement in performance came out releasing the CAS and AE on close() call

            @Override
            public void close() throws IOException {
              super.close();
              // release UIMA resources
              cas.release();
              ae.destroy();
            }
          

          Now investigating the use of CASPool for improving throughput on high usages scenarios.

          Show
          Tommaso Teofili added a comment - some improvement in performance came out releasing the CAS and AE on close() call @Override public void close() throws IOException { super.close(); // release UIMA resources cas.release(); ae.destroy(); } Now investigating the use of CASPool for improving throughput on high usages scenarios.
          Hide
          Robert Muir added a comment -

          Is that safe to do in Tokenizer.close() ?

          Because Tokenizer.close() is misleading/confusing, the instance is still reused after
          this for subsequent documents... in other words Tokenizer.close() closes resources like
          the Reader itself... it just happens to be that CAS/AE don't complain about you
          continuing to use them after they are release()'ed/destroy()'ed

          Show
          Robert Muir added a comment - Is that safe to do in Tokenizer.close() ? Because Tokenizer.close() is misleading/confusing, the instance is still reused after this for subsequent documents... in other words Tokenizer.close() closes resources like the Reader itself... it just happens to be that CAS/AE don't complain about you continuing to use them after they are release()'ed/destroy()'ed
          Hide
          Tommaso Teofili added a comment - - edited

          Because Tokenizer.close() is misleading/confusing, the instance is still reused after

          this for subsequent documents.

          When I call close() it looks the correct way one could reuse that Tokenizer instance is by calling reset(someOtherInput) before doing anything else, so, after adding

          assert reader != null : "input has been closed, please reset it";
          

          as first line inside the toString(Reader reader) method in BaseUIMATokenizer, I tried this test:

          
            @Test
            public void testSetReaderAndClose() throws Exception {
              StringReader input = new StringReader("the big brown fox jumped on the wood");
              Tokenizer t = new UIMAAnnotationsTokenizer("/uima/AggregateSentenceAE.xml", "org.apache.uima.TokenAnnotation", input);
              assertTokenStreamContents(t, new String[]{"the", "big", "brown", "fox", "jumped", "on", "the", "wood"});
              t.close();
              try {
                t.incrementToken();
                fail("should've been failing as reader is not set");
              } catch (AssertionError error) {
                // ok
              }
              input = new StringReader("hi oh my");
              t = new UIMAAnnotationsTokenizer("/uima/TestAggregateSentenceAE.xml", "org.apache.lucene.uima.ts.TokenAnnotation", input);
              assertTrue("should've been incremented ", t.incrementToken());
              t.close();
              try {
                t.incrementToken();
                fail("should've been failing as reader is not set");
              } catch (AssertionError error) {
                // ok
              }
              t.reset(new StringReader("hey what do you say"));
              assertTrue("should've been incremented ", t.incrementToken());
            }
          
          

          and it looks to me it's behaving correctly.
          Still working on improving it and trying to catch possible corner cases.

          Show
          Tommaso Teofili added a comment - - edited Because Tokenizer.close() is misleading/confusing, the instance is still reused after this for subsequent documents. When I call close() it looks the correct way one could reuse that Tokenizer instance is by calling reset(someOtherInput) before doing anything else, so, after adding assert reader != null : "input has been closed, please reset it" ; as first line inside the toString(Reader reader) method in BaseUIMATokenizer, I tried this test: @Test public void testSetReaderAndClose() throws Exception { StringReader input = new StringReader( "the big brown fox jumped on the wood" ); Tokenizer t = new UIMAAnnotationsTokenizer( "/uima/AggregateSentenceAE.xml" , "org.apache.uima.TokenAnnotation" , input); assertTokenStreamContents(t, new String []{ "the" , "big" , "brown" , "fox" , "jumped" , "on" , "the" , "wood" }); t.close(); try { t.incrementToken(); fail( "should've been failing as reader is not set" ); } catch (AssertionError error) { // ok } input = new StringReader( "hi oh my" ); t = new UIMAAnnotationsTokenizer( "/uima/TestAggregateSentenceAE.xml" , "org.apache.lucene.uima.ts.TokenAnnotation" , input); assertTrue( "should've been incremented " , t.incrementToken()); t.close(); try { t.incrementToken(); fail( "should've been failing as reader is not set" ); } catch (AssertionError error) { // ok } t.reset( new StringReader( "hey what do you say" )); assertTrue( "should've been incremented " , t.incrementToken()); } and it looks to me it's behaving correctly. Still working on improving it and trying to catch possible corner cases.
          Hide
          Robert Muir added a comment -

          Right, after you reset(Reader) you set a new reader.

          But the question is: is it safe to use CAS/AE after you call release()/destroy() on them?

          Because close() is called on tokenstreams after each invocation, in other words:

          Tokenizer t = new Tokenizer(reader);
          ... stuff ...
          t.close();
          t.reset(someOtherReader);
          .. stuff ...
          t.close();
          

          So what does CAS.release() really mean? If it means you should not use the CAS again afterwards,
          then we cannot have it in TokenStream.close(), and same with AE.destroy()

          Show
          Robert Muir added a comment - Right, after you reset(Reader) you set a new reader. But the question is: is it safe to use CAS/AE after you call release()/destroy() on them? Because close() is called on tokenstreams after each invocation, in other words: Tokenizer t = new Tokenizer(reader); ... stuff ... t.close(); t.reset(someOtherReader); .. stuff ... t.close(); So what does CAS.release() really mean? If it means you should not use the CAS again afterwards, then we cannot have it in TokenStream.close(), and same with AE.destroy()
          Hide
          Tommaso Teofili added a comment -

          But the question is: is it safe to use CAS/AE after you call release()/destroy() on them?

          no it isn't, so you're right: those methods should not be inside the close() method.

          Show
          Tommaso Teofili added a comment - But the question is: is it safe to use CAS/AE after you call release()/destroy() on them? no it isn't, so you're right: those methods should not be inside the close() method.
          Hide
          Tommaso Teofili added a comment -

          After some more testing I think the CasPool is good just for scenarios where the pool serves different CAS to different clients (the tokenizers), so not really helpful in the current implementation, however it may be useful if we abstract the operation of obtaining and releasing a CAS outside the BaseTokenizer.

          In the meantime I noticed the AEProviderFactory getAEProvider() methods have a keyPrefix parameter that came from Solr implementation and was intended to hold the core name, so, at the moment I think it'd be better to have (also) methods which don't need that paramater for the Lucene uses.

          Show
          Tommaso Teofili added a comment - After some more testing I think the CasPool is good just for scenarios where the pool serves different CAS to different clients (the tokenizers), so not really helpful in the current implementation, however it may be useful if we abstract the operation of obtaining and releasing a CAS outside the BaseTokenizer. In the meantime I noticed the AEProviderFactory getAEProvider() methods have a keyPrefix parameter that came from Solr implementation and was intended to hold the core name, so, at the moment I think it'd be better to have (also) methods which don't need that paramater for the Lucene uses.
          Hide
          Tommaso Teofili added a comment -

          the two methods analyzeText() and analyzeInput() are confusing so the first one should just be renamed as initializeIterator() as its main purpose is to prepare the FSIterator which holds the annotations that will be used inside the incrementToken() method.

          Show
          Tommaso Teofili added a comment - the two methods analyzeText() and analyzeInput() are confusing so the first one should just be renamed as initializeIterator() as its main purpose is to prepare the FSIterator which holds the annotations that will be used inside the incrementToken() method.
          Hide
          Tommaso Teofili added a comment -

          I think we can mark this one as resolved, just I'd keep this only for trunk and backport the whole thing to 3.x once SOLR-3013 is resolved and committed to trunk too.

          Show
          Tommaso Teofili added a comment - I think we can mark this one as resolved, just I'd keep this only for trunk and backport the whole thing to 3.x once SOLR-3013 is resolved and committed to trunk too.

            People

            • Assignee:
              Tommaso Teofili
              Reporter:
              Tommaso Teofili
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development