Uploaded image for project: 'Stanbol'
  1. Stanbol
  2. STANBOL-583

CELI enhancement engine(s) - Contribution to stanbol

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 0.9.0-incubating
    • Fix Version/s: enhancer-0.10.0
    • Component/s: Enhancer
    • Labels:
    • Environment:
      Enhancement Engines developed as web service clients

      Description

      The services included so far in the module as Enhancement Engines are:

      • a Named Entity Recognition service for French
      • a Lemmatizer for Italian, German, Romanian, Russian, Danish (it creates an annotation on the document whose content is the lemmatized form of the document)
      • a Language Identifier for Italian, French,German,Spanish, Portuguese, Polish, Hungarian, Dutch, Swedish,Arabic, Russian,Turkish, Romanian, Greek, Norwegian
      • a Document Classification services for Italian, French, German, English, Spanish, Portuguese that associates a document to DBPedia classes
      1. STANBOL-583-celi-engines_20120511_abosca.patch
        109 kB
        Alessio Bosca
      2. STANBOL-583-celi-engines_20120423_rwesten.patch
        110 kB
        Rupert Westenthaler
      3. celiPatchNER.patch
        9 kB
        Alessio Bosca
      4. celi.zip
        120 kB
        Alessio Bosca

        Issue Links

          Activity

          Hide
          rwesten Rupert Westenthaler added a comment -

          The CELI engines are now ready to use

          • they are included in the Enhancer Bundle List
          • users will need to add the License key OR to enable usage of the test account. After doing this one needs to manually stop/start the engine(s) in the components tab of the Felix Web Console ( {host}

            /system/console/components)

          NOTE: the license key can be set for all engines via a System property (-Dceli.license=

          {user}

          :

          {pwd}

          ) or by adding it to the sling.properties file in the stanbol root directory.

          A big thanks to alessio and the CELI team for their contributions!

          Show
          rwesten Rupert Westenthaler added a comment - The CELI engines are now ready to use they are included in the Enhancer Bundle List users will need to add the License key OR to enable usage of the test account. After doing this one needs to manually stop/start the engine(s) in the components tab of the Felix Web Console ( {host} /system/console/components) NOTE: the license key can be set for all engines via a System property (-Dceli.license= {user} : {pwd} ) or by adding it to the sling.properties file in the stanbol root directory. A big thanks to alessio and the CELI team for their contributions!
          Hide
          alessio.bosca Alessio Bosca added a comment -

          Hi Rupert,

          thanks for the feedback, we fixed the problem on server side. Now a 4** HTTP status is returned when the license key is not well formatted (and therefore not correct)

          Alessio

          Show
          alessio.bosca Alessio Bosca added a comment - Hi Rupert, thanks for the feedback, we fixed the problem on server side. Now a 4** HTTP status is returned when the license key is not well formatted (and therefore not correct) Alessio
          Hide
          rwesten Rupert Westenthaler added a comment -

          Hi Alessio,

          While testing I found an other server side issue.

          When configuring an illegal formatted license key - that is not in the form '

          {user-name}

          :

          {password}

          ' the CELI server answers with "200 OK" but sends as contents an plain text error message. This results than in a rather unrelated

          Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.

          Exception, because a valid XML is expected.

          In my opinion the server should return a HTTP status 4** (Bad Request) in those cases.

          I have written an according UnitTest [1], but for now deactivated the according test to do not break the build.

          NOTE: This could be also solved by validating the license key parameter within the activate methods of the CELI engines. If you prefer this option I would add an according utility method to the "org.apache.stanbol.enhancer.engines.celi.utils.Utils" class.

          [1] http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/celi/src/test/java/org/apache/stanbol/enhancer/engines/celi/CeliHttpTest.java

          Show
          rwesten Rupert Westenthaler added a comment - Hi Alessio, While testing I found an other server side issue. When configuring an illegal formatted license key - that is not in the form ' {user-name} : {password} ' the CELI server answers with "200 OK" but sends as contents an plain text error message. This results than in a rather unrelated Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog. Exception, because a valid XML is expected. In my opinion the server should return a HTTP status 4** (Bad Request) in those cases. I have written an according UnitTest [1] , but for now deactivated the according test to do not break the build. NOTE: This could be also solved by validating the license key parameter within the activate methods of the CELI engines. If you prefer this option I would add an according utility method to the "org.apache.stanbol.enhancer.engines.celi.utils.Utils" class. [1] http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/celi/src/test/java/org/apache/stanbol/enhancer/engines/celi/CeliHttpTest.java
          Hide
          alessio.bosca Alessio Bosca added a comment -

          Fixed problem caused by invalid XML character on service side.

          Show
          alessio.bosca Alessio Bosca added a comment - Fixed problem caused by invalid XML character on service side.
          Hide
          rwesten Rupert Westenthaler added a comment -

          Hi Alessio,

          while testing I found an other bug in your Server implementation. In revision 1344669 I added an other unit test to the NER engine that nicely reproduces it.

          The root cause is

          Caused by: org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x19) was found in the element content of the document.

          while creating the

          SOAPBody soapBody = message.getSOAPBody();

          for the response data of the NER response (NERserviceClientHTTP). Based on a short google search, I assume that the server does not correctly escape special chars in the labels of detected entities. Most posts suggest that using "StringEscapeUtils.escapeXml(..)" solves this.

          NOTE: This does not block this issue, as it does not affect the contributed Engine.

          Show
          rwesten Rupert Westenthaler added a comment - Hi Alessio, while testing I found an other bug in your Server implementation. In revision 1344669 I added an other unit test to the NER engine that nicely reproduces it. The root cause is Caused by: org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x19) was found in the element content of the document. while creating the SOAPBody soapBody = message.getSOAPBody(); for the response data of the NER response (NERserviceClientHTTP). Based on a short google search, I assume that the server does not correctly escape special chars in the labels of detected entities. Most posts suggest that using "StringEscapeUtils.escapeXml(..)" solves this. NOTE: This does not block this issue, as it does not affect the contributed Engine.
          Hide
          rwesten Rupert Westenthaler added a comment -

          Hi Alessio

          After applying the patch the unit test complete successful. I plan to go over all the engines again and do some testing of the Engines within a running Stanbol Instance later today.

          Show
          rwesten Rupert Westenthaler added a comment - Hi Alessio After applying the patch the unit test complete successful. I plan to go over all the engines again and do some testing of the Engines within a running Stanbol Instance later today.
          Hide
          alessio.bosca Alessio Bosca added a comment -

          Dear Rupert,

          we have fixed the service-side problems with start/end position in the text and the issue in the returned formKind. We also have added the support for italian NER
          I made a few changes in the client (extra param for the language) and I've just submitted a patch about that

          Alessio

          Show
          alessio.bosca Alessio Bosca added a comment - Dear Rupert, we have fixed the service-side problems with start/end position in the text and the issue in the returned formKind. We also have added the support for italian NER I made a few changes in the client (extra param for the language) and I've just submitted a patch about that Alessio
          Hide
          alessio.bosca Alessio Bosca added a comment -

          Added support for italian

          Show
          alessio.bosca Alessio Bosca added a comment - Added support for italian
          Hide
          rwesten Rupert Westenthaler added a comment -

          Hi Alessio

          First:

          Generated Enhancements:

          To make the validation of the Stanbol Enhancement Structure more common to all EnhancementEngines I implemented STANBOL-612. This new validation utility is now also used for the CELI NER engine and identified several issues with the created Enhancement. Some of them I have already fixed but there are two remaining where I will most likely need you help.

          1) The NER enhancement for "28 septembre 1934" (time) returns "28 eptembre 1934 " as formKind. Because of that the "selected-text" does not correspond with the parsed text and the validation fails.

          2) The start/end positions for "Paris" do have an offset of two chars. The validation states that "ris, " is selected instead of "Paris"

          Alessio it would be good if you could have a look at the two described issues as having access to the server side logs seams critical to work on that.

          best
          Rupert

          Detailed list of my changes to the CELI NER engine:

          • I have made the supported language(s) configurable as I am expecting that configuring a different service URL might bring the possibility to support other languages. Multiple languages can be configured by
          • comma separated String e.g. "fr;it;de"
          • Array or Collection of Strings e.g. ["fr","it","de"]
          • Enhancement creation:
          • start/end positions are now xsd:int
          • implemented an simple extraction of the selection-context (max 50char prefix/suffix to the selection but tries to cut of by words)
          • selected-text and selection-context are now PlainLiterals and use the language as detected for the text.
          • UnitTest now uses the EnhancementStructureHelper to validate created enhancements
          • Instead of using the CELI language identification engine the test now statically adds the triple "ci.getUri(),dc:language,'fr'" to the Enhancement graph.
          • NER http client:
          • tried to add "<?xml version="1.0" encoding="UTF-8"?>" to the request. However this had not the expected result.
          • revision #1338567 adds support of streaming the XML escaped text directly to the HTTP request.
          Show
          rwesten Rupert Westenthaler added a comment - Hi Alessio First: I had troubles applying your patch. While I think that i finally managed to apply it correctly you might want to validate this. to make further work more easy I created an own branch https://svn.apache.org/repos/asf/incubator/stanbol/branches/celi-enhancement-engines Generated Enhancements: To make the validation of the Stanbol Enhancement Structure more common to all EnhancementEngines I implemented STANBOL-612 . This new validation utility is now also used for the CELI NER engine and identified several issues with the created Enhancement. Some of them I have already fixed but there are two remaining where I will most likely need you help. 1) The NER enhancement for "28 septembre 1934" (time) returns "28 eptembre 1934 " as formKind. Because of that the "selected-text" does not correspond with the parsed text and the validation fails. 2) The start/end positions for "Paris" do have an offset of two chars. The validation states that "ris, " is selected instead of "Paris" Alessio it would be good if you could have a look at the two described issues as having access to the server side logs seams critical to work on that. best Rupert Detailed list of my changes to the CELI NER engine: I have made the supported language(s) configurable as I am expecting that configuring a different service URL might bring the possibility to support other languages. Multiple languages can be configured by comma separated String e.g. "fr;it;de" Array or Collection of Strings e.g. ["fr","it","de"] Enhancement creation: start/end positions are now xsd:int implemented an simple extraction of the selection-context (max 50char prefix/suffix to the selection but tries to cut of by words) selected-text and selection-context are now PlainLiterals and use the language as detected for the text. UnitTest now uses the EnhancementStructureHelper to validate created enhancements Instead of using the CELI language identification engine the test now statically adds the triple "ci.getUri(),dc:language,'fr'" to the Enhancement graph. NER http client: tried to add "<?xml version="1.0" encoding="UTF-8"?>" to the request. However this had not the expected result. revision #1338567 adds support of streaming the XML escaped text directly to the HTTP request.
          Hide
          alessio.bosca Alessio Bosca added a comment -

          Since I forgot to select the option when I uploaded my last patch I want to explicitly grant the Apache License 2.0 for "STANBOL-583-celi-engines_20120511_abosca.patch" as attached to STANBOL-583.

          Alessio

          Show
          alessio.bosca Alessio Bosca added a comment - Since I forgot to select the option when I uploaded my last patch I want to explicitly grant the Apache License 2.0 for " STANBOL-583 -celi-engines_20120511_abosca.patch" as attached to STANBOL-583 . Alessio
          Hide
          alessio.bosca Alessio Bosca added a comment -

          Hi Rupert,

          sorry for the delay in updating the patch..
          1) checked out the project from svn (11 May) and validated your changes
          2) added the support for Online mode and the relative dependency. I tested it locally and the engines are not loaded when stanbol is started with -Dorg.apache.stanbol.offline.mode=true
          3) explicitly specified the charset encoding as UTF-8; it should fix the issues you encountered. Could you please check if it works on your system? (I don't have a MAC for testing it)
          4) removed the reference to prefixes in XML response parsing
          5) single instance for **ClientHTTP in the enhancement engines

          Let me know if the patch is fine.

          Bests, Alessio

          Show
          alessio.bosca Alessio Bosca added a comment - Hi Rupert, sorry for the delay in updating the patch.. 1) checked out the project from svn (11 May) and validated your changes 2) added the support for Online mode and the relative dependency. I tested it locally and the engines are not loaded when stanbol is started with -Dorg.apache.stanbol.offline.mode=true 3) explicitly specified the charset encoding as UTF-8; it should fix the issues you encountered. Could you please check if it works on your system? (I don't have a MAC for testing it) 4) removed the reference to prefixes in XML response parsing 5) single instance for **ClientHTTP in the enhancement engines Let me know if the patch is fine. Bests, Alessio
          Hide
          alessio.bosca Alessio Bosca added a comment -

          Patch created against the svn version of 11 may 2012

          Show
          alessio.bosca Alessio Bosca added a comment - Patch created against the svn version of 11 may 2012
          Hide
          alessio.bosca Alessio Bosca added a comment -

          Hi Rupert,

          thanks for the work on integrating the Engine and for the feebacks. I'll
          work on the suggested todo list and send you an update as soon as it is
          ready (I should be able to send you a new version by Thursday)

          Bests
          Alessio


          *************************************
          Alessio Bosca, Ph.D.
          CELI s.r.l.
          Via San Quintino 31
          10121 Torino
          Tel. +39 011.562.71.15
          Fax +39 011.506.40.86
          http://www.celi.it
          *************************************

          Show
          alessio.bosca Alessio Bosca added a comment - Hi Rupert, thanks for the work on integrating the Engine and for the feebacks. I'll work on the suggested todo list and send you an update as soon as it is ready (I should be able to send you a new version by Thursday) Bests Alessio – ************************************* Alessio Bosca, Ph.D. CELI s.r.l. Via San Quintino 31 10121 Torino Tel. +39 011.562.71.15 Fax +39 011.506.40.86 http://www.celi.it *************************************
          Hide
          rwesten Rupert Westenthaler added a comment -

          NOTE: The originally attached zip archive was not a patch, but an archive of the source tree. Because this adds a new EnhancementEngine I was still able to correctly apply it by extracting the archive, copying to the /enhancer/engine and removing all svn metadata.

          Created a new Patch that includes the following changes

          • Applied some minor changes necessary to compile with recent changes within the trunk.
          • Dependencies
          • changed dependencies of the Apache commons httpclient to the OSGI bundle version "httpclient-osgi"
          • removed the unused dependency to OpenNLP
          • now there are no embedded dependencies
          • Logging
          • changed Logger API from Apache log4j to SLF4J - the logging Framework used by Apache Stanbol.
          • Loggings in the test still use log4j via SLF4J

          TODOs/Questions:

          1. Stanbol EnhancementEngine MUST support "offline mode": This ensures that no connections to external services are made if Stanbol is started in offline mode (-Dorg.apache.stanbol.offline.mode=true). EnhancementEngines that do require an external service need than to deactivate themself. This is easiest achieved by adding

          @Reference
          private OnlineMode onlineMode;

          as the OnlineMode service will only be available if OfflineMode is deactivated.

          You will also need to add

          <dependency>
          <groupId>org.apache.stanbol</groupId>
          <artifactId>org.apache.stanbol.commons.stanboltools.offline</artifactId>
          <scope>provided</scope>
          </dependency>

          2. While all unit tests succeed I noticed exceptions like

          com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
          at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:684)
          ...

          indicating that the char encoding used by the received data is not UTF8. In fact the responses of the service do not specify any encoding

          <?xml version="1.0" ?><S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/"><S:Body><ns2:guessLanguageResponse ...

          However I think this is related to how the request data is processed by **ClientHTTP.java classes.

          • the "doPost(..)" method returns a String and uses "UTF-8" for parsing the String from the received bytes. So far so good
          • the calling method than creates an other ByteArrayInputStream for the returned String by using String.getBytes(). This will create the byte[] representation of the String by using the Plattform encoding ("MAC Roman" in my case).
          • This stream is than set to SOAPPart#setContent(...). Now I assume that because the XML string does not include a explicit charset this implementation will use UTF-8 to parse the "MAC Roman" encoded byte sequence.

          I would suggest to change the doPost(..) method to return the InputStream and set this stream directly to SOAPPart#setContent(...).

          3. I noticed that for each request a **ClientHTTP instance is created. I would rather expect a single instance to be created during the engine activation or do I miss a good reason why it is better to create a new instance for each enhancement request?

          4. The ClassificationClientHTTP uses "ns2:label" and "ns2:score" to access the data. This seams dangerous as the used prefixes may depend on the used XML framework and those might change over time. I would suggest to explicitly refer to the namespace "http://linguagrid.org/v20110204/commons" instead.

          Alessio Bosca can you please

          1. validate that the my changes do work with the current trunk
          2. my changes in the dependencies do not break the engines
          3. add support for Offline Mode
          4. have a look at the char encoding issues I encountered

          On my TODO list is

          1. validation of the created RDF (TextAnnotations, EntityAnnotations, TopicAnnotations)
          2. read/write locks on the ContentItem and the metadata (as you return "ENHANCE_ASYNC" in the canEnhance(..) method this is necessary)
          3. testing the Engines on a Stanbol instance within a real EnhancementChain.

          Show
          rwesten Rupert Westenthaler added a comment - NOTE: The originally attached zip archive was not a patch, but an archive of the source tree. Because this adds a new EnhancementEngine I was still able to correctly apply it by extracting the archive, copying to the /enhancer/engine and removing all svn metadata. Created a new Patch that includes the following changes Applied some minor changes necessary to compile with recent changes within the trunk. Dependencies changed dependencies of the Apache commons httpclient to the OSGI bundle version "httpclient-osgi" removed the unused dependency to OpenNLP now there are no embedded dependencies Logging changed Logger API from Apache log4j to SLF4J - the logging Framework used by Apache Stanbol. Loggings in the test still use log4j via SLF4J TODOs/Questions: 1. Stanbol EnhancementEngine MUST support "offline mode": This ensures that no connections to external services are made if Stanbol is started in offline mode (-Dorg.apache.stanbol.offline.mode=true). EnhancementEngines that do require an external service need than to deactivate themself. This is easiest achieved by adding @Reference private OnlineMode onlineMode; as the OnlineMode service will only be available if OfflineMode is deactivated. You will also need to add <dependency> <groupId>org.apache.stanbol</groupId> <artifactId>org.apache.stanbol.commons.stanboltools.offline</artifactId> <scope>provided</scope> </dependency> 2. While all unit tests succeed I noticed exceptions like com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:684) ... indicating that the char encoding used by the received data is not UTF8. In fact the responses of the service do not specify any encoding <?xml version="1.0" ?><S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/"><S:Body><ns2:guessLanguageResponse ... However I think this is related to how the request data is processed by **ClientHTTP.java classes. the "doPost(..)" method returns a String and uses "UTF-8" for parsing the String from the received bytes. So far so good the calling method than creates an other ByteArrayInputStream for the returned String by using String.getBytes(). This will create the byte[] representation of the String by using the Plattform encoding ("MAC Roman" in my case). This stream is than set to SOAPPart#setContent(...). Now I assume that because the XML string does not include a explicit charset this implementation will use UTF-8 to parse the "MAC Roman" encoded byte sequence. I would suggest to change the doPost(..) method to return the InputStream and set this stream directly to SOAPPart#setContent(...). 3. I noticed that for each request a **ClientHTTP instance is created. I would rather expect a single instance to be created during the engine activation or do I miss a good reason why it is better to create a new instance for each enhancement request? 4. The ClassificationClientHTTP uses "ns2:label" and "ns2:score" to access the data. This seams dangerous as the used prefixes may depend on the used XML framework and those might change over time. I would suggest to explicitly refer to the namespace "http://linguagrid.org/v20110204/commons" instead. Alessio Bosca can you please 1. validate that the my changes do work with the current trunk 2. my changes in the dependencies do not break the engines 3. add support for Offline Mode 4. have a look at the char encoding issues I encountered On my TODO list is 1. validation of the created RDF (TextAnnotations, EntityAnnotations, TopicAnnotations) 2. read/write locks on the ContentItem and the metadata (as you return "ENHANCE_ASYNC" in the canEnhance(..) method this is necessary) 3. testing the Engines on a Stanbol instance within a real EnhancementChain.
          Hide
          rwesten Rupert Westenthaler added a comment -

          I was not able to work on that before leaving for WWW2012 so I will not be able to work on that until next weekend. So if someone can work on that this week it would be great.

          Show
          rwesten Rupert Westenthaler added a comment - I was not able to work on that before leaving for WWW2012 so I will not be able to work on that until next weekend. So if someone can work on that this week it would be great.
          Hide
          alessio.bosca Alessio Bosca added a comment -

          A demo installation with the submitted enhancement engines is available at http://research.celi.it:8082/
          For any problem or feedback: alessio.baosca@celi.it

          Show
          alessio.bosca Alessio Bosca added a comment - A demo installation with the submitted enhancement engines is available at http://research.celi.it:8082/ For any problem or feedback: alessio.baosca@celi.it
          Hide
          alessio.bosca Alessio Bosca added a comment -

          code of the enhancement engine (it expects to be placed in the project tree under enhancement/engines/ in order to correctly resolve the parent pom location)

          Show
          alessio.bosca Alessio Bosca added a comment - code of the enhancement engine (it expects to be placed in the project tree under enhancement/engines/ in order to correctly resolve the parent pom location)

            People

            • Assignee:
              rwesten Rupert Westenthaler
              Reporter:
              alessio.bosca Alessio Bosca
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development