Solr
  1. Solr
  2. SOLR-2129

Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
      The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.

      Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
      The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.

      More information can be found on the dedicated wiki page: http://wiki.apache.org/solr/SolrUIMA

      1. SOLR-2129-version-6.patch
        211 kB
        Tommaso Teofili
      2. SOLR-2129-version-5.patch
        211 kB
        Tommaso Teofili
      3. SOLR-2129-version3.patch
        208 kB
        Tommaso Teofili
      4. SOLR-2129-version2.patch
        212 kB
        Tommaso Teofili
      5. SOLR-2129-asf-headers.patch
        225 kB
        Tommaso Teofili
      6. SOLR-2129.patch
        209 kB
        Tommaso Teofili
      7. SOLR-2129.patch
        200 kB
        Robert Muir
      8. lib-jars.zip
        6.80 MB
        Tommaso Teofili

        Activity

        Hide
        Tommaso Teofili added a comment -

        Patch to port solr-uima GC project as a solr/contrib module

        Show
        Tommaso Teofili added a comment - Patch to port solr-uima GC project as a solr/contrib module
        Hide
        Tommaso Teofili added a comment -

        Same patch plus required ASF headers on code and xml

        Show
        Tommaso Teofili added a comment - Same patch plus required ASF headers on code and xml
        Hide
        Robert Muir added a comment -

        Hello, is it possible you could upload the jar files to this issue that it depends on?

        I tried to get them to test the patch, but i think there are problems in maven-land with Alchemy:

        http://repository.apache.org/snapshots/org/apache/uima/alchemy-annotator/2.3.1-SNAPSHOT/

        as you can see, the jar file is very out of date.

        Show
        Robert Muir added a comment - Hello, is it possible you could upload the jar files to this issue that it depends on? I tried to get them to test the patch, but i think there are problems in maven-land with Alchemy: http://repository.apache.org/snapshots/org/apache/uima/alchemy-annotator/2.3.1-SNAPSHOT/ as you can see, the jar file is very out of date.
        Hide
        Tommaso Teofili added a comment -

        Hello Robert, in attachment you can find an archive containing all lib/*.jar files

        Show
        Tommaso Teofili added a comment - Hello Robert, in attachment you can find an archive containing all lib/*.jar files
        Hide
        Robert Muir added a comment -

        Thanks Tommaso!

        I applied the patch: the build and tests work correctly, there aren't any intl/localization issues, and the code looks clean.

        Would another committer more familiar with these parts of Solr take a look? It looks like a good feature.

        Show
        Robert Muir added a comment - Thanks Tommaso! I applied the patch: the build and tests work correctly, there aren't any intl/localization issues, and the code looks clean. Would another committer more familiar with these parts of Solr take a look? It looks like a good feature.
        Hide
        Tommaso Teofili added a comment -

        Hello Robert,
        as it seems this patch hasn't been committed yet, I wonder if there is anything I should do or may help with.
        If so, please let me know that

        Show
        Tommaso Teofili added a comment - Hello Robert, as it seems this patch hasn't been committed yet, I wonder if there is anything I should do or may help with. If so, please let me know that
        Hide
        Robert Muir added a comment -

        Hi Tommaso: i was hoping to get another person to look at it, since it is not my area of expertise.

        But no one is stepping up, so I will take it. It will take me longer to review it though (sorry)!

        Show
        Robert Muir added a comment - Hi Tommaso: i was hoping to get another person to look at it, since it is not my area of expertise. But no one is stepping up, so I will take it. It will take me longer to review it though (sorry)!
        Hide
        Robert Muir added a comment -

        Tommaso: I noticed the following in the maven configuration:

        <source>1.6</source>
        <target>1.6</target>
        

        But I took the patch and applied to branch_3x (java 5-only), and just removed 3 interface @Overrides and everything worked with java 5
        Can you confirm this is correct (that UIMA does not require java 6)?

        If the patch only needs java 5, then it makes it possible to apply to our 3.x branch also.

        Show
        Robert Muir added a comment - Tommaso: I noticed the following in the maven configuration: <source>1.6</source> <target>1.6</target> But I took the patch and applied to branch_3x (java 5-only), and just removed 3 interface @Overrides and everything worked with java 5 Can you confirm this is correct (that UIMA does not require java 6)? If the patch only needs java 5, then it makes it possible to apply to our 3.x branch also.
        Hide
        Joern Kottmann added a comment -

        I am also interested in using this patch. Is it possible to run custom UIMA analysis or only the pre-defined AlchemyAPI analysis ?

        Show
        Joern Kottmann added a comment - I am also interested in using this patch. Is it possible to run custom UIMA analysis or only the pre-defined AlchemyAPI analysis ?
        Hide
        Tommaso Teofili added a comment -

        Robert thanks for that, I confirm that UIMA doesn't require java 6, java 5 is fine so this is fine for branc_3x too.

        Jörn, good to see you here too you can run also custom UIMA Analysis.
        By default the default AEs are WhitespaceTokenizer, Tagger, AlchemyAPIAnnotator, OpenCalaisAnnotator.

        To customize the default behavior you should:
        a) change the OverridingParamsExtServicesAEDescriptor and (eventually) eventually extend BaseUIMAUpdateRequestProcessor and its SolrUIMAConsumers

        or

        b) define a new AE descriptor and create for it a new class extending UIMAUpdateRequestProcessor (or extend BaseUIMAUpdateRequestProcessor) then modify the UIMAUpdateRequestProcessorFactory to initialize that class instead of the base one.

        If you need any parameters to be set at runtime for a delegate AE, you must set, inside the aggregate AE, an overriding parameter that overrides some parameter in the delegate AE and then define its runtime value in solrconfig with:

        <uimaConfig>
        <runtimeParameters>
        <overriding_param_name>RUNTIMEVALUE</overriding_param_name>
        </runtimeParameters>
        </uimaConfig>

        Show
        Tommaso Teofili added a comment - Robert thanks for that, I confirm that UIMA doesn't require java 6, java 5 is fine so this is fine for branc_3x too. Jörn, good to see you here too you can run also custom UIMA Analysis. By default the default AEs are WhitespaceTokenizer, Tagger, AlchemyAPIAnnotator, OpenCalaisAnnotator. To customize the default behavior you should: a) change the OverridingParamsExtServicesAEDescriptor and (eventually) eventually extend BaseUIMAUpdateRequestProcessor and its SolrUIMAConsumers or b) define a new AE descriptor and create for it a new class extending UIMAUpdateRequestProcessor (or extend BaseUIMAUpdateRequestProcessor) then modify the UIMAUpdateRequestProcessorFactory to initialize that class instead of the base one. If you need any parameters to be set at runtime for a delegate AE, you must set, inside the aggregate AE, an overriding parameter that overrides some parameter in the delegate AE and then define its runtime value in solrconfig with: <uimaConfig> <runtimeParameters> <overriding_param_name>RUNTIMEVALUE</overriding_param_name> </runtimeParameters> </uimaConfig>
        Hide
        Mark Miller added a comment -

        I'm going to take a look at this when i get a chance as well. This looks like solid stuff.

        Show
        Mark Miller added a comment - I'm going to take a look at this when i get a chance as well. This looks like solid stuff.
        Hide
        Grant Ingersoll added a comment -

        Cool stuff, Tommaso. I'm starting to look at adding classifiers into Solr via Mahout, so thought I would look at this too.

        Couple of early things, based on looking at the getting started instructions.

        1. I think we should do like we do with Tika and provide a way for users to map UIMA output to Solr fields as opposed to having to hardcode in specific fields.
        2. For the Jars, have a look at how the clustering is setup. We should be able to just point at the UIMA libs in solrconfig.xml under contrib/uima/lib instead of having to copy them around
        Show
        Grant Ingersoll added a comment - Cool stuff, Tommaso. I'm starting to look at adding classifiers into Solr via Mahout, so thought I would look at this too. Couple of early things, based on looking at the getting started instructions. I think we should do like we do with Tika and provide a way for users to map UIMA output to Solr fields as opposed to having to hardcode in specific fields. For the Jars, have a look at how the clustering is setup. We should be able to just point at the UIMA libs in solrconfig.xml under contrib/uima/lib instead of having to copy them around
        Hide
        Tommaso Teofili added a comment -

        Hi Grant, I think it would be great to have Mahout classifiers inside Solr

        I like your suggestion at point 1.
        I can change the current hardcoded mapping mechanism using instead a simple mapping between UIMA extracted types/features and field names defined inside solrconfig.xml.

        A different option could be to develop a SolrCASConsumer component in UIMA (similar to Lucas [1], Lucene CAS Consumer) providing full control on how UIMA annotations and features can be mapped to Solr fields, but on UIMA side

        Regarding point 2 the jars are already under contrib/uima/lib so I can modify the sample solrconfig.xml adding the proper <lib> tag.
        Thanks for your comments and suggestions.

        [1] : https://svn.apache.org/repos/asf/uima/sandbox/trunk/Lucas

        Show
        Tommaso Teofili added a comment - Hi Grant, I think it would be great to have Mahout classifiers inside Solr I like your suggestion at point 1. I can change the current hardcoded mapping mechanism using instead a simple mapping between UIMA extracted types/features and field names defined inside solrconfig.xml. A different option could be to develop a SolrCASConsumer component in UIMA (similar to Lucas [1] , Lucene CAS Consumer) providing full control on how UIMA annotations and features can be mapped to Solr fields, but on UIMA side Regarding point 2 the jars are already under contrib/uima/lib so I can modify the sample solrconfig.xml adding the proper <lib> tag. Thanks for your comments and suggestions. [1]  : https://svn.apache.org/repos/asf/uima/sandbox/trunk/Lucas
        Hide
        Grant Ingersoll added a comment -

        I can change the current hardcoded mapping mechanism using instead a simple mapping between UIMA extracted types/features and field names defined inside solrconfig.xml

        Try to reuse the same syntax as the mapping in the ExtractingRequestHandler.

        A different option could be to develop a SolrCASConsumer component in UIMA (similar to Lucas [1], Lucene CAS Consumer) providing full control on how UIMA annotations and features can be mapped to Solr fields, but on UIMA side

        I've been struggling with these kinds of questions a lot lately. That is, the marriage of two projects. Where should the code go? Setting up another ASF project is a pain in the amount of hoops to jump through. Apache Labs doesn't cut it for a number of reasons. Hosting on Github or Google Code is OK, but loses the ASF community aspect. Sigh.

        Regarding point 2 the jars are already under contrib/uima/lib so I can modify the sample solrconfig.xml adding the proper <lib> tag.

        Yep, exactly what I had in mind.

        Show
        Grant Ingersoll added a comment - I can change the current hardcoded mapping mechanism using instead a simple mapping between UIMA extracted types/features and field names defined inside solrconfig.xml Try to reuse the same syntax as the mapping in the ExtractingRequestHandler. A different option could be to develop a SolrCASConsumer component in UIMA (similar to Lucas [1] , Lucene CAS Consumer) providing full control on how UIMA annotations and features can be mapped to Solr fields, but on UIMA side I've been struggling with these kinds of questions a lot lately. That is, the marriage of two projects. Where should the code go? Setting up another ASF project is a pain in the amount of hoops to jump through. Apache Labs doesn't cut it for a number of reasons. Hosting on Github or Google Code is OK, but loses the ASF community aspect. Sigh. Regarding point 2 the jars are already under contrib/uima/lib so I can modify the sample solrconfig.xml adding the proper <lib> tag. Yep, exactly what I had in mind.
        Hide
        Tommaso Teofili added a comment -

        Try to reuse the same syntax as the mapping in the ExtractingRequestHandler.

        ok, I added the <lib> tag and will commit a new patch when I'm finished with these changes

        I've been struggling with these kinds of questions a lot lately. That is, the marriage of two projects. Where should the code go? Setting up another ASF project is a pain in the amount of hoops to jump through. Apache Labs doesn't cut it for a number of reasons. Hosting on Github or Google Code is OK, but loses the ASF community aspect. Sigh.

        I agree with your point; I don't think it's easy to come with a final good and general answer for such situations.

        What comes to my mind to solve it generally is establishing a single wide-purpose ASF project which contains integrations between many different ASF projects, this could be good to prepare the base for two projects that want to "marry" but it could be too much general and maybe not easy to maintain from a community point of view (e.g.: should all the Lucene committers commit on "integrations" project too only because someone integrated it with UIMA?); another option could be to force two marrying projects to respect a standard (e.g. CMIS) so that developing a specialized "connector" wouldn't be needed anymore but I don't think it's always possible to do so since it could require a huge effort.

        In this particular case, in my opinion, the code should go into the proper project depending on which "pipeline" is being changed/enhanced. Therefore since in this Solr-UIMA integration we're adding a step to the Solr indexing process via an UpdateRequestProcessor I think it should be part of Solr codebase whereas since in the SolrCASConsumer we'd be adding a (final) Consumer to the UIMA pipeline that should be part of UIMA codebase.

        Show
        Tommaso Teofili added a comment - Try to reuse the same syntax as the mapping in the ExtractingRequestHandler. ok, I added the <lib> tag and will commit a new patch when I'm finished with these changes I've been struggling with these kinds of questions a lot lately. That is, the marriage of two projects. Where should the code go? Setting up another ASF project is a pain in the amount of hoops to jump through. Apache Labs doesn't cut it for a number of reasons. Hosting on Github or Google Code is OK, but loses the ASF community aspect. Sigh. I agree with your point; I don't think it's easy to come with a final good and general answer for such situations. What comes to my mind to solve it generally is establishing a single wide-purpose ASF project which contains integrations between many different ASF projects, this could be good to prepare the base for two projects that want to "marry" but it could be too much general and maybe not easy to maintain from a community point of view (e.g.: should all the Lucene committers commit on "integrations" project too only because someone integrated it with UIMA?); another option could be to force two marrying projects to respect a standard (e.g. CMIS) so that developing a specialized "connector" wouldn't be needed anymore but I don't think it's always possible to do so since it could require a huge effort. In this particular case, in my opinion, the code should go into the proper project depending on which "pipeline" is being changed/enhanced. Therefore since in this Solr-UIMA integration we're adding a step to the Solr indexing process via an UpdateRequestProcessor I think it should be part of Solr codebase whereas since in the SolrCASConsumer we'd be adding a (final) Consumer to the UIMA pipeline that should be part of UIMA codebase.
        Hide
        Tommaso Teofili added a comment -

        Try to reuse the same syntax as the mapping in the ExtractingRequestHandler.

        Inside <uimaConfig> there are many possible ways that configuration can be defined.
        Let's say we want to map the feature 'text' of type 'ConceptFS' on the field 'concept', I thought 3 options, listed here

        1. exactly same syntax as ExtractingRequestHandler, though Solr-UIMA is not a RequestHandler but an UpdateRequestProcessor; could this create confusion?
        <lst name="defaults">
        <str name="fmap.org.apache.uima.alchemy.ts.categorization.ConceptFS@text">concept</str>
        </lst>

        2. define the feature of a type to map over a field with one tag
        <map field="concept" feature="org.apache.uima.alchemy.ts.categorization.ConceptFS@text"/>

        3. have a more hierarchical and strict structure, though not so immediate to understand and maybe easier for UIMA experts
        <type name="org.apache.uima.alchemy.ts.categorization.ConceptFS">
        <feature name="text">concept</feature>
        </type>

        What do you think?
        Thanks for any advice,
        Tommaso

        Show
        Tommaso Teofili added a comment - Try to reuse the same syntax as the mapping in the ExtractingRequestHandler. Inside <uimaConfig> there are many possible ways that configuration can be defined. Let's say we want to map the feature 'text' of type 'ConceptFS' on the field 'concept', I thought 3 options, listed here 1. exactly same syntax as ExtractingRequestHandler, though Solr-UIMA is not a RequestHandler but an UpdateRequestProcessor; could this create confusion? <lst name="defaults"> <str name="fmap.org.apache.uima.alchemy.ts.categorization.ConceptFS@text">concept</str> </lst> 2. define the feature of a type to map over a field with one tag <map field="concept" feature="org.apache.uima.alchemy.ts.categorization.ConceptFS@text"/> 3. have a more hierarchical and strict structure, though not so immediate to understand and maybe easier for UIMA experts <type name="org.apache.uima.alchemy.ts.categorization.ConceptFS"> <feature name="text">concept</feature> </type> What do you think? Thanks for any advice, Tommaso
        Hide
        Tommaso Teofili added a comment -

        I think I found the following good compromise:

        <type name="org.apache.uima.jcas.tcas.Annotation">
        <map feature="coveredText" field="tag"/>
        </type>

        I've also made configurable (in solrconfig.xml) the fields to analyze and analysis engine path

        Show
        Tommaso Teofili added a comment - I think I found the following good compromise: <type name="org.apache.uima.jcas.tcas.Annotation"> <map feature="coveredText" field="tag"/> </type> I've also made configurable (in solrconfig.xml) the fields to analyze and analysis engine path
        Hide
        Tommaso Teofili added a comment -

        Huge Solr-UIMA refactoring, including injecting the following information from <uimaConfig> tag inside solrconfig:

        1. added dynamic field mapping with the following syntax:
        <fieldMapping>
        <type name="org.apache.uima.jcas.tcas.Annotation">
        <map feature="coveredText" field="tag"/>
        </type>
        <type name="org.apache.uima.jcas.tcas.AnotherAnnotationType">
        <map feature="featureName" field="anotherField"/>
        </type>
        </fieldMapping>

        2. added AnalysisEngine descriptor path (must be inside the classpath)
        <analysisEngine>/org/apache/uima/desc/OverridingParamsExtServicesAE.xml</analysisEngine>

        3. added fields' values to be analyzed, eventually merging their values to make UIMA run only once:
        <analyzeFields merge="false">text,title</analyzeFields>

        Runtime parameters for defining overriding parameters for delegate AEs remains the same:
        <runtimeParameters>
        <keyword_apikey>VALID_ALCHEMYAPI_KEY</keyword_apikey>
        <concept_apikey>VALID_ALCHEMYAPI_KEY</concept_apikey>
        <lang_apikey>VALID_ALCHEMYAPI_KEY</lang_apikey>
        <cat_apikey>VALID_ALCHEMYAPI_KEY</cat_apikey>
        <oc_licenseID>VALID_OPENCALAIS_KEY</oc_licenseID>
        </runtimeParameters>

        These changes should make the use of such a module much easier and flexible.
        Looking forward for your feedback.
        Tommaso

        Show
        Tommaso Teofili added a comment - Huge Solr-UIMA refactoring, including injecting the following information from <uimaConfig> tag inside solrconfig: 1. added dynamic field mapping with the following syntax: <fieldMapping> <type name="org.apache.uima.jcas.tcas.Annotation"> <map feature="coveredText" field="tag"/> </type> <type name="org.apache.uima.jcas.tcas.AnotherAnnotationType"> <map feature="featureName" field="anotherField"/> </type> </fieldMapping> 2. added AnalysisEngine descriptor path (must be inside the classpath) <analysisEngine>/org/apache/uima/desc/OverridingParamsExtServicesAE.xml</analysisEngine> 3. added fields' values to be analyzed, eventually merging their values to make UIMA run only once: <analyzeFields merge="false">text,title</analyzeFields> Runtime parameters for defining overriding parameters for delegate AEs remains the same: <runtimeParameters> <keyword_apikey>VALID_ALCHEMYAPI_KEY</keyword_apikey> <concept_apikey>VALID_ALCHEMYAPI_KEY</concept_apikey> <lang_apikey>VALID_ALCHEMYAPI_KEY</lang_apikey> <cat_apikey>VALID_ALCHEMYAPI_KEY</cat_apikey> <oc_licenseID>VALID_OPENCALAIS_KEY</oc_licenseID> </runtimeParameters> These changes should make the use of such a module much easier and flexible. Looking forward for your feedback. Tommaso
        Hide
        Tommaso Teofili added a comment -

        Hi all, in case someone had a chance to try the latest patch please let me know your feedback.

        Show
        Tommaso Teofili added a comment - Hi all, in case someone had a chance to try the latest patch please let me know your feedback.
        Hide
        Justinas Jaronis added a comment -

        I tried Your latest patch however after compiling it doesn't include resources (./contrib/uima/src/resources/*) to the compiled project. So posting fails :

        java.lang.RuntimeException: org.apache.uima.resource.ResourceInitializationException
        at org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:81)
        at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1359)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:326)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
        at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
        at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
        Caused by: org.apache.uima.resource.ResourceInitializationException
        at org.apache.solr.uima.processor.ae.OverridingParamsAEProvider.getAE(OverridingParamsAEProvider.java:85)
        at org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processText(UIMAUpdateRequestProcessor.java:115)
        at org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:68)
        ... 24 more
        Caused by: java.lang.NullPointerException
        at org.apache.uima.util.XMLInputSource.<init>(XMLInputSource.java:114)
        at org.apache.solr.uima.processor.ae.OverridingParamsAEProvider.getAE(OverridingParamsAEProvider.java:64)
        ... 26 more

        when OverridingParamsAEProvider tries to read /org/apache/uima/desc/OverridingParamsExtServicesAE.xml . Where this file (and its fellow XMLs) should be located?

        Thanks for the effort. Great project!

        Show
        Justinas Jaronis added a comment - I tried Your latest patch however after compiling it doesn't include resources (./contrib/uima/src/resources/*) to the compiled project. So posting fails : java.lang.RuntimeException: org.apache.uima.resource.ResourceInitializationException at org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:81) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1359) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: org.apache.uima.resource.ResourceInitializationException at org.apache.solr.uima.processor.ae.OverridingParamsAEProvider.getAE(OverridingParamsAEProvider.java:85) at org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processText(UIMAUpdateRequestProcessor.java:115) at org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:68) ... 24 more Caused by: java.lang.NullPointerException at org.apache.uima.util.XMLInputSource.<init>(XMLInputSource.java:114) at org.apache.solr.uima.processor.ae.OverridingParamsAEProvider.getAE(OverridingParamsAEProvider.java:64) ... 26 more when OverridingParamsAEProvider tries to read /org/apache/uima/desc/OverridingParamsExtServicesAE.xml . Where this file (and its fellow XMLs) should be located? Thanks for the effort. Great project!
        Hide
        Tommaso Teofili added a comment -

        Hi Justinas,
        you should have each needed XML under solr/contrib/uima/src/main/resources/org/apache/uima/desc/.
        Maybe I need to fix the ant build.xml.
        I'll inspect it, thanks for your feedback

        Show
        Tommaso Teofili added a comment - Hi Justinas, you should have each needed XML under solr/contrib/uima/src/main/resources/org/apache/uima/desc/. Maybe I need to fix the ant build.xml. I'll inspect it, thanks for your feedback
        Hide
        Justinas Jaronis added a comment -

        The file is present in source this place after Your patch, but it doesn't appear in any JARs / WARs (or maybe it doesn't have to appear? ). And I don't find any location for manual injection. Tried to copy whole directory structure to example/, but no luck. Thank You for the fast response.

        Show
        Justinas Jaronis added a comment - The file is present in source this place after Your patch, but it doesn't appear in any JARs / WARs (or maybe it doesn't have to appear? ). And I don't find any location for manual injection. Tried to copy whole directory structure to example/, but no luck. Thank You for the fast response.
        Hide
        Tommaso Teofili added a comment -

        Thanks Justinas, I've found and fixed the problem, a new patch will come shortly.

        Show
        Tommaso Teofili added a comment - Thanks Justinas, I've found and fixed the problem, a new patch will come shortly.
        Hide
        Tommaso Teofili added a comment -

        Here is a new patch with updated contrib/uima/build.xml to include resources in the generated package.
        Also there is small README inside to guide configuration.

        Show
        Tommaso Teofili added a comment - Here is a new patch with updated contrib/uima/build.xml to include resources in the generated package. Also there is small README inside to guide configuration.
        Hide
        Justinas Jaronis added a comment -

        Woohoo! Works like a charm. One slight note, after trying to index some documents I added multiValued="true" to the "entity*" field in schema.xml (I believe UIMA handles entities as array)
        Thanks again. Very very much Hope i'll also bring some resources into this project.

        Show
        Justinas Jaronis added a comment - Woohoo! Works like a charm. One slight note, after trying to index some documents I added multiValued="true" to the "entity*" field in schema.xml (I believe UIMA handles entities as array) Thanks again. Very very much Hope i'll also bring some resources into this project.
        Hide
        Tommaso Teofili added a comment -

        I'm glad you appreciated! And thanks for the hint

        Show
        Tommaso Teofili added a comment - I'm glad you appreciated! And thanks for the hint
        Hide
        kamil added a comment -

        Hi Tommaso,

        I'm really curious to take a look at your work, unfortunately it doesn't compile after applying the patch:

        BUILD FAILED
        <your Solr trunk checkout dir>/trunk/solr/contrib/uima/build.xml:65: The following error occurred while executing this line:
        <your Solr trunk checkout dir>/trunk/solr/common-build.xml:267: /home/kamil/dev/solr/solr-old/trunk/solr/contrib/uima/lib does not exist.

        Obviously it worked out for Justinas so I am wondering what is wrong. Any idea?

        Great project, by the way!!!

        Show
        kamil added a comment - Hi Tommaso, I'm really curious to take a look at your work, unfortunately it doesn't compile after applying the patch: BUILD FAILED <your Solr trunk checkout dir>/trunk/solr/contrib/uima/build.xml:65: The following error occurred while executing this line: <your Solr trunk checkout dir>/trunk/solr/common-build.xml:267: /home/kamil/dev/solr/solr-old/trunk/solr/contrib/uima/lib does not exist. Obviously it worked out for Justinas so I am wondering what is wrong. Any idea? Great project, by the way!!!
        Hide
        Tommaso Teofili added a comment -

        Hi Kamil,
        can you please take a look at your trunk/solr/contrib/uima does the lib folder exist? Can you find the jars in there?
        Let me know and thanks for your feedback

        Show
        Tommaso Teofili added a comment - Hi Kamil, can you please take a look at your trunk/solr/contrib/uima does the lib folder exist? Can you find the jars in there? Let me know and thanks for your feedback
        Hide
        Tommaso Teofili added a comment -

        Maybe a dedicated page on the wiki could help on installing, testing, extending this patch.
        Any opinions?

        Show
        Tommaso Teofili added a comment - Maybe a dedicated page on the wiki could help on installing, testing, extending this patch. Any opinions?
        Hide
        Lance Norskog added a comment -

        +1. There is a lot of material behind UIMA, and a wiki page describing it and some sample use cases would go a long way.

        Show
        Lance Norskog added a comment - +1. There is a lot of material behind UIMA, and a wiki page describing it and some sample use cases would go a long way.
        Hide
        kamil added a comment -

        Hi Tomasso,

        the trunk/solr/contrib/uima folder doesn't exist so I can't find any jars.
        Basically I follow the steps mentioned here: http://wiki.apache.org/solr/HowToContribute , i.e.

        • checkout trunk
        • apply patch (after that trunk/solr/contrib/uima exists)
        • ant build

        The build fails with above mentioned error.

        Show
        kamil added a comment - Hi Tomasso, the trunk/solr/contrib/uima folder doesn't exist so I can't find any jars. Basically I follow the steps mentioned here: http://wiki.apache.org/solr/HowToContribute , i.e. checkout trunk apply patch (after that trunk/solr/contrib/uima exists) ant build The build fails with above mentioned error.
        Hide
        Lance Norskog added a comment -

        The directory trunk/solr/contrib/uima does not exist because either the directory is not in the patch. The patch should include an empty "marker" file in trunk/solr/contrib/uima/lib so that the directory gets made.

        Show
        Lance Norskog added a comment - The directory trunk/solr/contrib/uima does not exist because either the directory is not in the patch. The patch should include an empty "marker" file in trunk/solr/contrib/uima/lib so that the directory gets made.
        Hide
        Robert Muir added a comment -

        patch synced to trunk.

        i also adjusted some minor things: doesn't rely on CWD for running tests, added an assume in tests in case you have no internet connection, with a set timeout, removed troublesome xml includes as this is dependent on CWD, etc.

        I reviewed the code, I have no problem committing this to contrib so future iterations can be from svn. any objections?

        Show
        Robert Muir added a comment - patch synced to trunk. i also adjusted some minor things: doesn't rely on CWD for running tests, added an assume in tests in case you have no internet connection, with a set timeout, removed troublesome xml includes as this is dependent on CWD, etc. I reviewed the code, I have no problem committing this to contrib so future iterations can be from svn. any objections?
        Hide
        Mark Miller added a comment -

        I have no problem committing this to contrib so future iterations can be from svn. any objections?

        +1 - getting into trunk will likely expand usage and feedback, and get things rolling much faster. Bar is much lower for Solr contrib as well.

        I've only started looking at the patch, but a few notes I jotted down:

        StringBuffer usage in UpdateRequestProcessor - should be StringBuilder right?

        The below is a little odd, no (critical code I know )?

        /* execute the AE on the given JCas */
        private void executeAE(AnalysisEngine ae, JCas jcas) throws AnalysisEngineProcessException

        { ae.getLogger().log(Level.INFO, new StringBuffer("Analazying text").toString()); ae.process(jcas); ae.getLogger().log(Level.INFO, new StringBuffer("Text processing completed").toString()); }

        AEProviderFactory should be thread safe?? At a min, you have to consider multicore ... consider that you could be sharing AEProvider across threads because of this as well (static cache in AEProviderFactory). Perhaps the cache should not be static?

        Don't want to at least log this?

        } catch (AnalysisEngineProcessException e)

        { // do nothing }
        Show
        Mark Miller added a comment - I have no problem committing this to contrib so future iterations can be from svn. any objections? +1 - getting into trunk will likely expand usage and feedback, and get things rolling much faster. Bar is much lower for Solr contrib as well. I've only started looking at the patch, but a few notes I jotted down: StringBuffer usage in UpdateRequestProcessor - should be StringBuilder right? The below is a little odd, no (critical code I know )? /* execute the AE on the given JCas */ private void executeAE(AnalysisEngine ae, JCas jcas) throws AnalysisEngineProcessException { ae.getLogger().log(Level.INFO, new StringBuffer("Analazying text").toString()); ae.process(jcas); ae.getLogger().log(Level.INFO, new StringBuffer("Text processing completed").toString()); } AEProviderFactory should be thread safe?? At a min, you have to consider multicore ... consider that you could be sharing AEProvider across threads because of this as well (static cache in AEProviderFactory). Perhaps the cache should not be static? Don't want to at least log this? } catch (AnalysisEngineProcessException e) { // do nothing }
        Hide
        Tommaso Teofili added a comment -

        StringBuffer usage in UpdateRequestProcessor - should be StringBuilder right?

        yes, right.

        private void executeAE(AnalysisEngine ae, JCas jcas) throws AnalysisEngineProcessException { ae.getLogger().log(Level.INFO, new StringBuffer("Analazying text").toString()); ae.process(jcas); ae.getLogger().log(Level.INFO, new StringBuffer("Text processing completed").toString()); }

        I wanted to logically isolate everything regarding actual processing of text, but I agree that this piece of code would look better inside the calling method ( processText(String) ).

        AEProviderFactory should be thread safe?? At a min, you have to consider multicore ... consider that you could be sharing AEProvider across threads because of this as well (static cache in AEProviderFactory). Perhaps the cache should not be static?

        Thanks Mark for this, I agree the cache shouldn't be static especially in cases where each core has AEs with same classpaths but different runtime parameters.
        For what concerns OverridingParamsAEProvider (the only AEProvider impl available at the moment) being processed by different threads we can make the getAE() method synchronized (or, perhaps, making cachedAE field volatile, but need to check better).

        Don't want to at least log this? } catch (AnalysisEngineProcessException e) { // do nothing }

        I wanted the UIMA enrichment pipeline to be error safe but I agree it'd be reasonable to log the error in this case (even if I don't like logging exceptions in general).

        Show
        Tommaso Teofili added a comment - StringBuffer usage in UpdateRequestProcessor - should be StringBuilder right? yes, right. private void executeAE(AnalysisEngine ae, JCas jcas) throws AnalysisEngineProcessException { ae.getLogger().log(Level.INFO, new StringBuffer("Analazying text").toString()); ae.process(jcas); ae.getLogger().log(Level.INFO, new StringBuffer("Text processing completed").toString()); } I wanted to logically isolate everything regarding actual processing of text, but I agree that this piece of code would look better inside the calling method ( processText(String) ). AEProviderFactory should be thread safe?? At a min, you have to consider multicore ... consider that you could be sharing AEProvider across threads because of this as well (static cache in AEProviderFactory). Perhaps the cache should not be static? Thanks Mark for this, I agree the cache shouldn't be static especially in cases where each core has AEs with same classpaths but different runtime parameters. For what concerns OverridingParamsAEProvider (the only AEProvider impl available at the moment) being processed by different threads we can make the getAE() method synchronized (or, perhaps, making cachedAE field volatile, but need to check better). Don't want to at least log this? } catch (AnalysisEngineProcessException e) { // do nothing } I wanted the UIMA enrichment pipeline to be error safe but I agree it'd be reasonable to log the error in this case (even if I don't like logging exceptions in general).
        Hide
        Tommaso Teofili added a comment -

        Just forgot to say: I'll create a new patch from the above considerations

        Show
        Tommaso Teofili added a comment - Just forgot to say: I'll create a new patch from the above considerations
        Hide
        Tommaso Teofili added a comment - - edited

        Changes are:

        • Drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor.
          - Make the getAE method in OverridingParamAEProvider synchronized to support concurrent requests to the provider.
          - Make the getAEProvider method in AEProviderFactory synchronized and make the cache "core aware", each core has now an AEProvider for each analysis engine's path.
          - The UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter instead of a SolrConfig object.

        I tested it with multiple cores and concurrent updates for each core.

        Show
        Tommaso Teofili added a comment - - edited Changes are: Drop StringBuffer for StringBuilder in UIMAUpdateRequestProcessor. - Make the getAE method in OverridingParamAEProvider synchronized to support concurrent requests to the provider. - Make the getAEProvider method in AEProviderFactory synchronized and make the cache "core aware", each core has now an AEProvider for each analysis engine's path. - The UIMAUpdateRequestProcessor constructor accepts SolrCore as a parameter instead of a SolrConfig object. I tested it with multiple cores and concurrent updates for each core.
        Hide
        Lance Norskog added a comment -

        Don't want to at least log this? } catch (AnalysisEngineProcessException e) { // do nothing }

        I wanted the UIMA enrichment pipeline to be error safe but I agree it'd be reasonable to log the error in this case (even if I don't like logging exceptions in general).

        Please do not hide errors in any way. Nobody reads logs. If it fails in production, I want to know immediately and fix it. Please just throw all exceptions up the stack.

        Show
        Lance Norskog added a comment - Don't want to at least log this? } catch (AnalysisEngineProcessException e) { // do nothing } I wanted the UIMA enrichment pipeline to be error safe but I agree it'd be reasonable to log the error in this case (even if I don't like logging exceptions in general). Please do not hide errors in any way. Nobody reads logs. If it fails in production, I want to know immediately and fix it. Please just throw all exceptions up the stack.
        Hide
        Tommaso Teofili added a comment -

        Please do not hide errors in any way. Nobody reads logs. If it fails in production, I want to know immediately and fix it. Please just throw all exceptions up the stack.

        I think your point is a good one Lance, when I started working on this patch I wanted to avoid breaking the indexing pipeline (as this was an "add-on") but now that it's more stable I agree that any exception should be thrown.

        Show
        Tommaso Teofili added a comment - Please do not hide errors in any way. Nobody reads logs. If it fails in production, I want to know immediately and fix it. Please just throw all exceptions up the stack. I think your point is a good one Lance, when I started working on this patch I wanted to avoid breaking the indexing pipeline (as this was an "add-on") but now that it's more stable I agree that any exception should be thrown.
        Hide
        Tommaso Teofili added a comment -

        Each UIMAException (wrapping both ResourceInitializationException and AnalysisEngineProcessException) is now thrown, embedded in a RuntimeException (the processAdd method signature has to be aligned with the super class one so not declaring the UIMAException in the UIMAUpdateRequestProcessor method signature).

        Show
        Tommaso Teofili added a comment - Each UIMAException (wrapping both ResourceInitializationException and AnalysisEngineProcessException) is now thrown, embedded in a RuntimeException (the processAdd method signature has to be aligned with the super class one so not declaring the UIMAException in the UIMAUpdateRequestProcessor method signature).
        Hide
        Robert Muir added a comment -

        Tommaso, thanks for resolving all the items brought up in comments.

        Show
        Robert Muir added a comment - Tommaso, thanks for resolving all the items brought up in comments.
        Hide
        Robert Muir added a comment -

        Committed revision 1062604 (trunk), 1062606 (branch_3x)

        Thanks Tommaso!

        Show
        Robert Muir added a comment - Committed revision 1062604 (trunk), 1062606 (branch_3x) Thanks Tommaso!
        Hide
        Tommaso Teofili added a comment -

        Thanks Robert for taking care

        Show
        Tommaso Teofili added a comment - Thanks Robert for taking care
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1.0 release

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1.0 release

          People

          • Assignee:
            Robert Muir
            Reporter:
            Tommaso Teofili
          • Votes:
            6 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development