Solr
  1. Solr
  2. SOLR-3013

Add UIMA based tokenizers / filters that can be used in the schema.xml

    Details

      Description

      Add UIMA based tokenizers / filters that can be declared and used directly inside the schema.xml.
      Thus instead of using the UIMA UpdateRequestProcessor one could directly define per-field NLP capable tokenizers / filters.

      1. SOLR-3013.patch
        127 kB
        Tommaso Teofili

        Issue Links

          Activity

          Tommaso Teofili created issue -
          Hide
          Tommaso Teofili added a comment -

          patch overview:

          • moved the 'ae' package out of 'processor' package since it's to be used by tokenizers too
          • created an 'analysis' package which contains tokenizers/analyzers/tokenizerfactories
          • updated the 'Introduction' section inside CHANGES.txt

          The UIMAAnnotationsTokenizer creates tokens using annotations created over the input Reader.
          The UIMATypeAwareAnnotationsTokenizer creates tokens using annotations created over the input Reader adding also the TypeAttribute according to the specified UIMA FeaturePath.

          Show
          Tommaso Teofili added a comment - patch overview: moved the 'ae' package out of 'processor' package since it's to be used by tokenizers too created an 'analysis' package which contains tokenizers/analyzers/tokenizerfactories updated the 'Introduction' section inside CHANGES.txt The UIMAAnnotationsTokenizer creates tokens using annotations created over the input Reader. The UIMATypeAwareAnnotationsTokenizer creates tokens using annotations created over the input Reader adding also the TypeAttribute according to the specified UIMA FeaturePath.
          Tommaso Teofili made changes -
          Field Original Value New Value
          Attachment SOLR-3013.patch [ 12511488 ]
          Tommaso Teofili made changes -
          Assignee Tommaso Teofili [ teofili ]
          Hide
          Tommaso Teofili added a comment -

          If no one objects I'll commit this shortly.

          Show
          Tommaso Teofili added a comment - If no one objects I'll commit this shortly.
          Hide
          Chris Male added a comment -

          Hey Tommaso,

          Did a quick glance over the patch. Couple of things:

          • Could UIMATypeAwareAnalyzerTest (and any other Analyzer/Tokenizer tests) use BaseTokenStreamTestCase? It has some useful utility methods to verify that your Analyzer works as expected
          • UIMABaseAnalyzerTest could do the same, and could probably make use of newDirectory() etc to handle some of the boilerplate
          Show
          Chris Male added a comment - Hey Tommaso, Did a quick glance over the patch. Couple of things: Could UIMATypeAwareAnalyzerTest (and any other Analyzer/Tokenizer tests) use BaseTokenStreamTestCase? It has some useful utility methods to verify that your Analyzer works as expected UIMABaseAnalyzerTest could do the same, and could probably make use of newDirectory() etc to handle some of the boilerplate
          Hide
          Robert Muir added a comment -

          in addition to what Chris said:

          • it looks like some correctOffset() etc are missing (these would be detected by BaseTokenStreamTestCase.checkRandomData likely)
          • the analysis components look as if they might be able to work with lucene too... maybe we could refactor the
            Tokenizer/Analyzer/etc in a new modules/analysis/uima that depends on uima? And Solr uima module would
            provide the factories to integrate
          Show
          Robert Muir added a comment - in addition to what Chris said: it looks like some correctOffset() etc are missing (these would be detected by BaseTokenStreamTestCase.checkRandomData likely) the analysis components look as if they might be able to work with lucene too... maybe we could refactor the Tokenizer/Analyzer/etc in a new modules/analysis/uima that depends on uima? And Solr uima module would provide the factories to integrate
          Hide
          Chris Male added a comment -

          the analysis components look as if they might be able to work with lucene too... maybe we could refactor the
          Tokenizer/Analyzer/etc in a new modules/analysis/uima that depends on uima? And Solr uima module would
          provide the factories to integrate

          I absolutely agree.

          Show
          Chris Male added a comment - the analysis components look as if they might be able to work with lucene too... maybe we could refactor the Tokenizer/Analyzer/etc in a new modules/analysis/uima that depends on uima? And Solr uima module would provide the factories to integrate I absolutely agree.
          Hide
          Tommaso Teofili added a comment -

          Chris, Robert, thanks for your comments, I'll integrate your suggestions in a new patch.
          I agree with the module proposal as this was part of a following issue/discussion I'd be going to raise.
          Maybe I can create a new issue in Lucene for creating a new module under modules/analysis/uima containing just the Lucene UIMA tokenizers and then create a new patch for this one which contains only the factories.

          Show
          Tommaso Teofili added a comment - Chris, Robert, thanks for your comments, I'll integrate your suggestions in a new patch. I agree with the module proposal as this was part of a following issue/discussion I'd be going to raise. Maybe I can create a new issue in Lucene for creating a new module under modules/analysis/uima containing just the Lucene UIMA tokenizers and then create a new patch for this one which contains only the factories.
          Hide
          Chris Male added a comment -

          +1, Go for it.

          Show
          Chris Male added a comment - +1, Go for it.
          Tommaso Teofili made changes -
          Link This issue depends on LUCENE-3731 [ LUCENE-3731 ]
          Hide
          Tommaso Teofili added a comment -

          Considering the needed refactoring to put the tokenizers/analyzers in a dedicated Lucene analysis module I think the 'ae' package for creating AnalysisEngines should be moved to that module as well, so that there is a common mechanism for instantiating AnalysisEngines both in Lucene and Solr.

          Show
          Tommaso Teofili added a comment - Considering the needed refactoring to put the tokenizers/analyzers in a dedicated Lucene analysis module I think the 'ae' package for creating AnalysisEngines should be moved to that module as well, so that there is a common mechanism for instantiating AnalysisEngines both in Lucene and Solr.
          Hide
          Tommaso Teofili added a comment -

          Now that LUCENE-3731 has been resolved I'll proceed with adding the needed factories for the Tokenizers in Solr.

          Show
          Tommaso Teofili added a comment - Now that LUCENE-3731 has been resolved I'll proceed with adding the needed factories for the Tokenizers in Solr.
          tommaso committed 1295330 (16 files)
          Reviews: none

          [SOLR-3013] - removing the ae package from Solr as it's now under analysis/uima module, adding the Solr factories for UIMA based tokenizers

          Hide
          Tommaso Teofili added a comment -

          Solr factories committed in r1295330

          Show
          Tommaso Teofili added a comment - Solr factories committed in r1295330
          Hide
          Steve Rowe added a comment -

          Javadocs errors found on Jenkins, I think related to your commit, Tommaso? - from https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12565/consoleText:

            [javadoc] Constructing Javadoc information...
            [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMATypeAwareAnnotationsTokenizerFactory.java:21: package org.apache.lucene.analysis.uima does not exist
            [javadoc] import org.apache.lucene.analysis.uima.UIMATypeAwareAnnotationsTokenizer;
            [javadoc]                                       ^
            [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMAAnnotationsTokenizerFactory.java:21: package org.apache.lucene.analysis.uima does not exist
            [javadoc] import org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer;
            [javadoc]                                       ^
            [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java:26: package org.apache.lucene.analysis.uima.ae does not exist
            [javadoc] import org.apache.lucene.analysis.uima.ae.AEProvider;
            [javadoc]                                          ^
            [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java:27: package org.apache.lucene.analysis.uima.ae does not exist
            [javadoc] import org.apache.lucene.analysis.uima.ae.AEProviderFactory;
            [javadoc]                                          ^
            [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java:51: cannot find symbol
            [javadoc] symbol  : class AEProvider
            [javadoc] location: class org.apache.solr.uima.processor.UIMAUpdateRequestProcessor
            [javadoc]   private AEProvider aeProvider;
            [javadoc]           ^
            [javadoc] Standard Doclet version 1.6.0
            [javadoc] Building tree for all the packages and classes...
            [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMAAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMAAnnotationsTokenizer
            [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMATypeAwareAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMATypeAwareAnnotationsTokenizer
            [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMAAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMAAnnotationsTokenizer
            [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMATypeAwareAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMATypeAwareAnnotationsTokenizer
            [javadoc] Generating /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/build/docs/api/org/apache/solr/util/package-summary.html...
            [javadoc] Copying file /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/core/src/java/org/apache/solr/util/doc-files/min-should-match.html to directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/build/docs/api/org/apache/solr/util/doc-files...
            [javadoc] Generating /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/build/docs/api/serialized-form.html...
            [javadoc] Copying file /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/tools/prettify/stylesheet+prettify.css to file /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/build/docs/api/stylesheet+prettify.css...
            [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMAAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMAAnnotationsTokenizer
            [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMATypeAwareAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMATypeAwareAnnotationsTokenizer
            [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMAAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMAAnnotationsTokenizer
            [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMATypeAwareAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMATypeAwareAnnotationsTokenizer
            [javadoc] Building index for all the packages and classes...
            [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMAAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMAAnnotationsTokenizer
            [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMATypeAwareAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMATypeAwareAnnotationsTokenizer
            [javadoc] Building index for all classes...
            [javadoc] Generating /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/build/docs/api/help-doc.html...
            [javadoc] 15 warnings
          Show
          Steve Rowe added a comment - Javadocs errors found on Jenkins, I think related to your commit, Tommaso? - from https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12565/consoleText : [javadoc] Constructing Javadoc information... [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMATypeAwareAnnotationsTokenizerFactory.java:21: package org.apache.lucene.analysis.uima does not exist [javadoc] import org.apache.lucene.analysis.uima.UIMATypeAwareAnnotationsTokenizer; [javadoc] ^ [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMAAnnotationsTokenizerFactory.java:21: package org.apache.lucene.analysis.uima does not exist [javadoc] import org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer; [javadoc] ^ [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java:26: package org.apache.lucene.analysis.uima.ae does not exist [javadoc] import org.apache.lucene.analysis.uima.ae.AEProvider; [javadoc] ^ [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java:27: package org.apache.lucene.analysis.uima.ae does not exist [javadoc] import org.apache.lucene.analysis.uima.ae.AEProviderFactory; [javadoc] ^ [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java:51: cannot find symbol [javadoc] symbol : class AEProvider [javadoc] location: class org.apache.solr.uima.processor.UIMAUpdateRequestProcessor [javadoc] private AEProvider aeProvider; [javadoc] ^ [javadoc] Standard Doclet version 1.6.0 [javadoc] Building tree for all the packages and classes... [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMAAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMAAnnotationsTokenizer [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMATypeAwareAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMATypeAwareAnnotationsTokenizer [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMAAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMAAnnotationsTokenizer [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMATypeAwareAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMATypeAwareAnnotationsTokenizer [javadoc] Generating /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/build/docs/api/org/apache/solr/util/package-summary.html... [javadoc] Copying file /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/core/src/java/org/apache/solr/util/doc-files/min-should-match.html to directory /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/build/docs/api/org/apache/solr/util/doc-files... [javadoc] Generating /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/build/docs/api/serialized-form.html... [javadoc] Copying file /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/tools/prettify/stylesheet+prettify.css to file /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/build/docs/api/stylesheet+prettify.css... [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMAAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMAAnnotationsTokenizer [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMATypeAwareAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMATypeAwareAnnotationsTokenizer [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMAAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMAAnnotationsTokenizer [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMATypeAwareAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMATypeAwareAnnotationsTokenizer [javadoc] Building index for all the packages and classes... [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMAAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMAAnnotationsTokenizer [javadoc] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/contrib/uima/src/java/org/apache/solr/uima/analysis/UIMATypeAwareAnnotationsTokenizerFactory.java:30: warning - Tag @link: reference not found: UIMATypeAwareAnnotationsTokenizer [javadoc] Building index for all classes... [javadoc] Generating /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/solr/build/docs/api/help-doc.html... [javadoc] 15 warnings
          Hide
          Tommaso Teofili added a comment -

          thanks Steven, now fixing

          Show
          Tommaso Teofili added a comment - thanks Steven, now fixing
          tommaso committed 1295501 (1 file)
          Reviews: none

          [SOLR-3013] - adding analyzers-uima pathelement to fix error in javadocs generation

          tommaso committed 1295508 (1 file)
          Reviews: none

          [SOLR-3013] - removed unneded custom ant configuration for solr-uima

          Hide
          Tommaso Teofili added a comment -

          it should be ok now.

          Show
          Tommaso Teofili added a comment - it should be ok now.
          tommaso committed 1295594 (1 file)
          Reviews: none

          [SOLR-3013] - fix the build of solr-uima

          tommaso committed 1295765 (1 file)
          Reviews: none

          [SOLR-3013] - added lucene-analyzers-uima dependency to solr-uima pom

          Hide
          Lance Norskog added a comment -

          Is this committed?

          Show
          Lance Norskog added a comment - Is this committed?
          Hide
          Erick Erickson added a comment -

          Well, it's still marked Resolution: "unresolved" so I assume not.

          Show
          Erick Erickson added a comment - Well, it's still marked Resolution: "unresolved" so I assume not.
          Hide
          Yonik Seeley added a comment -

          Well, it's still marked Resolution: "unresolved" so I assume not.

          As long as commit messages have the JIRA issue in there, you can just click on "All" to see all commit related activity for the issue.

          Show
          Yonik Seeley added a comment - Well, it's still marked Resolution: "unresolved" so I assume not. As long as commit messages have the JIRA issue in there, you can just click on "All" to see all commit related activity for the issue.
          Hide
          Tommaso Teofili added a comment -

          yes, this is committed but it's not resolved yet as it needs to be adapted to 3.x as well.

          Show
          Tommaso Teofili added a comment - yes, this is committed but it's not resolved yet as it needs to be adapted to 3.x as well.
          Robert Muir made changes -
          Fix Version/s 3.6 [ 12319065 ]
          Hide
          Tommaso Teofili added a comment -

          due to the refactoring needed I think it makes sense to have this just in 4.0

          Show
          Tommaso Teofili added a comment - due to the refactoring needed I think it makes sense to have this just in 4.0
          Tommaso Teofili made changes -
          Affects Version/s 3.5 [ 12317876 ]
          Tommaso Teofili made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Kai Gülzau added a comment -

          http://wiki.apache.org/solr/SolrUIMA is not mentioning these analyzers/tokenizers.
          Is there any documentation how to use these?

          Show
          Kai Gülzau added a comment - http://wiki.apache.org/solr/SolrUIMA is not mentioning these analyzers/tokenizers. Is there any documentation how to use these?
          Gavin made changes -
          Link This issue depends on LUCENE-3731 [ LUCENE-3731 ]
          Gavin made changes -
          Link This issue depends upon LUCENE-3731 [ LUCENE-3731 ]
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Tommaso Teofili
              Reporter:
              Tommaso Teofili
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development