NOTE: The originally attached zip archive was not a patch, but an archive of the source tree. Because this adds a new EnhancementEngine I was still able to correctly apply it by extracting the archive, copying to the /enhancer/engine and removing all svn metadata.
Created a new Patch that includes the following changes
- Applied some minor changes necessary to compile with recent changes within the trunk.
- changed dependencies of the Apache commons httpclient to the OSGI bundle version "httpclient-osgi"
- removed the unused dependency to OpenNLP
- now there are no embedded dependencies
- changed Logger API from Apache log4j to SLF4J - the logging Framework used by Apache Stanbol.
- Loggings in the test still use log4j via SLF4J
1. Stanbol EnhancementEngine MUST support "offline mode": This ensures that no connections to external services are made if Stanbol is started in offline mode (-Dorg.apache.stanbol.offline.mode=true). EnhancementEngines that do require an external service need than to deactivate themself. This is easiest achieved by adding
private OnlineMode onlineMode;
as the OnlineMode service will only be available if OfflineMode is deactivated.
You will also need to add
2. While all unit tests succeed I noticed exceptions like
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
indicating that the char encoding used by the received data is not UTF8. In fact the responses of the service do not specify any encoding
<?xml version="1.0" ?><S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/"><S:Body><ns2:guessLanguageResponse ...
However I think this is related to how the request data is processed by **ClientHTTP.java classes.
- the "doPost(..)" method returns a String and uses "UTF-8" for parsing the String from the received bytes. So far so good
- the calling method than creates an other ByteArrayInputStream for the returned String by using String.getBytes(). This will create the byte representation of the String by using the Plattform encoding ("MAC Roman" in my case).
- This stream is than set to SOAPPart#setContent(...). Now I assume that because the XML string does not include a explicit charset this implementation will use UTF-8 to parse the "MAC Roman" encoded byte sequence.
I would suggest to change the doPost(..) method to return the InputStream and set this stream directly to SOAPPart#setContent(...).
3. I noticed that for each request a **ClientHTTP instance is created. I would rather expect a single instance to be created during the engine activation or do I miss a good reason why it is better to create a new instance for each enhancement request?
4. The ClassificationClientHTTP uses "ns2:label" and "ns2:score" to access the data. This seams dangerous as the used prefixes may depend on the used XML framework and those might change over time. I would suggest to explicitly refer to the namespace "http://linguagrid.org/v20110204/commons" instead.
Alessio Bosca can you please
1. validate that the my changes do work with the current trunk
2. my changes in the dependencies do not break the engines
3. add support for Offline Mode
4. have a look at the char encoding issues I encountered
On my TODO list is
1. validation of the created RDF (TextAnnotations, EntityAnnotations, TopicAnnotations)
2. read/write locks on the ContentItem and the metadata (as you return "ENHANCE_ASYNC" in the canEnhance(..) method this is necessary)
3. testing the Engines on a Stanbol instance within a real EnhancementChain.