Uploaded image for project: 'Stanbol'
  1. Stanbol
  2. STANBOL-1326

Stanbol Enhancer 2.0 API



    • Type: Epic
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 2.0.0
    • Labels:
    • Epic Name:
      Enhancer 2.0


      Enhancer API v2.0


      This describes changes and addition to the Stanbol Enhancer API with version 1.0.

      Main Features of the new API are

      • Clear separation between
        1. the content and analysis results
        2. metadata and state of the enhancement process
      • Support for [EnhancementProperties](https://issues.apache.org/jira/browse/STANBOL-488) (Note: light weight version is also supported started from `0.12.1` and `1.0.0`- see STANBOL-1280(https://issues.apache.org/jira/browse/STANBOL-1280) for details) EnhancementProperties can be used for Enhancement Chain / ExecutionPlan specific parameters as well as Request specific parameters. Typical use cases include: Parsing of credentials for remote services; the configuration of dereferenced fields, minimum confidence values, ...
      • Low level support for [Enhancement Workflows](https://issues.apache.org/jira/browse/STANBOL-1008): The new API will allow to create `EnhancementJobs` directly based on RDF [ExecutionPlans](https://stanbol.apache.org/docs/trunk/components/enhancer/chains/ExecutionPlan) in addition to Enhancement `Chains`. In addition The `EnhancementJobManager` will support partial executions of selected `ExecutionNodes` as well as resuming the enhancement after an change of the execution plan. This will allow enhancement workflows e.g. to (1) start with a simple language detection; (2) add additional `ExecutionNodes` based on the detected language and resume processing by parsing the `EnhancementJob` again the the `EnhancementJobManager`
      • Low level support for distributed computation of EnhancementJobs: The API will allow to execute only selected `ExecutionNodes`of an [ExecutionPlan](https://stanbol.apache.org/docs/trunk/components/enhancer/chains/ExecutionPlan). This will allow to have different Stanbol Worker with different configurations. `EnhancementJobManager` running on workers could than be instructed to only execute specific `ExecutionNodes`.

      The following sections do provide an overview about API changes and additions.



      The `EnhancementJob`represents the process of the enhancement of an `ContentItem` by the Stanbol Enhancer. It is a new interface introduced with `1.0`. Before 1.0 this was an implementation specific class used by the [EventJobManager](http://stanbol.staging.apache.org/docs/trunk/components/enhancer/enhancementjobmanager#eventjobmanager).

      	+ getJobId : NonLiteral
              + getLock() : ReadWriteLock
              + getExecutionMetadata() : MGraph
              + getContentItem() : ContentItem

      The `EnhancementJob` provides access to both the `ContentItem` and processing information. Only parsers, Writers and the `EnhancementJobManager` are intended to have a reference to the `EnhancementJob`. `EnhancementEngines` will only get an reference to the `ContentItem`. Engines will also no longer be able to access the `MGraph` with the [ExecutionMetadata](http://stanbol.apache.org/docs/trunk/components/enhancer/executionmetadata) nor the [ExecutionPlan](https://stanbol.apache.org/docs/trunk/components/enhancer/chains/ExecutionPlan). Both can be obtained in 0.12.1 via the [ContentParts](http://stanbol.staging.apache.org/stanbol/docs/trunk/enhancer/contentitem.#contentparts) of the processed `ContentItem`.

      The `jobId` of the EnhancementJob is used to reference the Job. It SHOULD be different as the URI of the ContentItem to avoid issues with multiple requests for the same ContentItem (as described by STANBOL-830(https://issues.apache.org/jira/browse/STANBOL-830)

      The EnhancementJob API does not distinguish between the [ExecutionPlan](https://stanbol.apache.org/docs/trunk/components/enhancer/chains/ExecutionPlan) and the [ExecutionMetadata](http://stanbol.apache.org/docs/trunk/components/enhancer/executionmetadata). There is only a single getter for the ExecutionMetadata that need to provide access to both.

      In most cases it will be sufficient to copy over the triples of the ExecutionPlan to the `MGraph` of the ExecutionMetadata before starting the enhancement. However in use cases where the ExecutionPlan might change (e.g. in between several partial executions) one can also use a setting where the ExecutionPlan is kept in a separate graph. In enforce this the Clerezza `UnionMGraph` implementation can be used. This implementation supports to create an union view over several TripleCollections while all modifications are done on the first one. So creating a `UnionMGraph`with the MGraph holding the ExectionMetadata at idx `0` and the the TripleCollection with the ExecutionPlan at idx `1` results in the desired setting.



      The job manager interface is very simple. It only contains the method to process an EnhancementJob. Optionally an array of `ep:ExecutionNode` instances can be parsed.

              + enhance(EnhancementJob job, NonLiteral...executions)

      The parsed `EnhancementJob` is expected to have its ExecutionMetadata to be initialized. In contrast to earlier Stanbol version the `EnhancementJobManager` is no longer responsible to initialize those Metadata based on the parsed enhancement `Chain`. This is now in the responsibility of the `EnhancementJobBuilder`.

      The new `EnhancementJobManager` will support partial executions. This means that the callers can request the JobManager to process only some of the `ep:ExecutionNode` defined by the [ExecutionPlan](https://stanbol.apache.org/docs/trunk/components/enhancer/chains/ExecutionPlan). If no executions are defined the `EnhancementJobManager` is expected to execute all execution nodes.

      If a array of `ep:ExecutionNode` instances is parsed the EnhancementJobManager must only consider to process those and ignore all others. If those executions do `ep:dependsOn` on another `ep:ExecutionNode` that is not included and not yet completed (not `ep:optional` and not yet processed) the job manager is expected to fail with a `ChainException`.

      The `EnhancementJobManager` needs to consider existing `em:EngineExecutions` and their `em:status`. This is important correctly resume the processing of partially completed enhancement jobs.



      The EnhancementJobBuilder allows to create EnhancementJobs. As building an EnhancementJob requires to select specific implementations of the `EnhancementJob` and `ContentItem` the `EnhancementJobBuilder` does not have a constructor, but an own `EnhancementJobFactory` is used. The `EnhancementJobFactory` is an OSGI service and can be looked up as those by components that need to build `EnhancementJob` instances.

              + create() : EnhancementJobBuilder
      	+ contentSource(ContentSource) : EnhancementJobBuilder
      	+ id(String id)
      	+ cotentRef(ContentReference)
      	+ chain(Chain chain)
      	+ execPlan(TripleCollection ExecutionPlan)
      	+ **(..)
      	+ build() : EnhancementJob

      Intended Usage:

          EnhancementJobFactory ejf;
          EnhancementJobManager ejm;
          ContentSource content; //the parsed content
          Chain chain; //the requested enhancement chain

      The `EnhancementJobBuilder` is obtained by using the EnhancementJobFactory#create() method. After creation the builder provides an API to set the parsed content, id as well as the enhancement chain. As an alternative the ExecutionPlan can also be set as RDF graph. After the configuration the `EnhancementJob` can be `#build()` and parsed to the `EnhancementJobManager`.



      There will be also minor API adaptions to the ContentItem API. The main reason for that is the removal of the `ContentItemFactory` combined with the requirement of some `EnhancementEngines` to create `Blob` instances. Because of that methods will be added to the ContentItem that allow add an `Blob` content part based on a `ContentSource` as well as a `ContentSink`

              + addContent(UriRef id, ContentSource source) : Blob
              + addContentStream(UriRef id, String mediaType) : ContentStink

      This methods will replace the `ContentItemFactory#createBlob(..)` and `ContentItemFactory#createContentSink(..)` methods. This means that EnhancementEngines that need to create `Blobs` need no longer care about obtaining a `ContentItemFactory` instance. The right `Blob` implementation to be used will already be wired when the `ContentItem` is created by the `EnhancementJobBuilder`.


      • the `ContentItem#addPart(..)` method can still be used to add `Blob` instances to the `ContentItem`. This might be useful for Engines that do provide their own `Blob` implementation.
      • both `addContent*` methods will override any contentPart registered with the parsed id. Those methods do NOT return the previously registered part such as the `#addPart(..)` method.



      The API of the `EnhancementEngine` interface will be adapted to parse the [EnhancementProperties](https://issues.apache.org/jira/browse/STANBOL-488) as additional parameter of the `#computeEnhancements(..)` method

              + getName() : String
              + canEnhance(ContentItem ci) : int
              + computeEnhancements(ContentItem ci, Map<String,Object> properties)

      A new Map instance with a copy of the properties will be parsed to the engine. Therefore changes to the map will have no side effects.

      For details about EnhancementProperties see STANBOL-488(https://issues.apache.org/jira/browse/STANBOL-488.




            • Assignee:
              rwesten Rupert Westenthaler
              rwesten Rupert Westenthaler
            • Votes:
              0 Vote for this issue
              1 Start watching this issue


              • Created: