Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Contenthub
    • None

    Description

      The SemanticIndex is the Interface used by the ContentHub to semantically index ContentItems (2nd level store). It is anticipated that a ContentHub will manage multiple semantic indexes of possible different implementations.

      Expected Implementations of this Interface include

      • The current Solr/LDPath based semantic index component
      • The current Contenthub default index (also Solr based)
      • A SPARQL based variant implemented by a Triple Store

      The remaining Specification includes the definition of the SemanticIndex interface as well as the SemanticIndexManager.

      SemanticIndex
      --------------------

      The Java interface for semantic indexes as used by the Apache Stanbol Contenthub

          1. Identification

      :::java
      /** The name of the Index */
      + getName()
      /** An optional free text description */
      + getDescription()

      The name of the semantic index is intended to be used for simple lookups as well as relative paths within the RESTful interfaces. However it MUST NOT be considered as unique. See section [Semantic Index Management](#Semantic_Index_Management) for details on how to resolve name conflicts.

          1. Indexing

      First the interface defines methods for indexing/removing documents to the semantic index

      :::java
      /** Indexes the parsed ContentItem */
      + index(ContentItem ci) : boolean
      /** Deletes the ContentItme with the parsed di */
      + remove(String ciUri)
      /** Ensures that changes to the index are persisted */
      + persist(long revision)
      /** Getter for the highest successfully persisted revision */
      + getRevision() : long

      The boolean returned by the index method allows to indicate if the parsed ContentItem was actually included to the Semantic Index. Seamtic index may define filters on the content items to be included in the semantic index.

      The persist Method is intended to be used to indicate the Semantic Index that indexing has been finished. This allows the semantic index to form batches over multiple calls to index(..) and remove(..) what may improve performance when indexing multiple ContentItems.

      In addition it is used to parse the highest revision of a indexed content item. If no revision was yet announced to a Semantic index - persist(..) was never called - than getRevision() shall return a negative number.

      The revision will be used by the ContentHub to re-synchronize the contents of a semantic index enhanced ContentItems present in [Store](store.html) when it becomes active. Usually the long value will represent the time in milliseconds such as returned by <code>System.currentTimeMillis()</code> but this is no requirement. It is only important that after each change of the Store interface results in an increase of this number.

      All above methods may throw an SemanticIndexingException. This is a sub class of ContenthubException.

          1. Index State

      Semantic Indexes do provide the following state information

      /** The state of the semantic index */
      + getState() : IndexState

      The IndexState is a simple Java enum that defines the following states:

      • <code>UNINIT</code> : The index was defined, the configuration is ok, but the contents are not yet indexed and the indexing has not yet started. (Intended to be used as default state after creations)
      • <code>INDEXING</code>: The (initial) indexing of content items is currently in progress. This indicates that the index is currently NOT active.
      • <code>ACTIVE</code>: The semantic index is available and in sync
      • <code>REINDEXING</code>: The (re)-indexing of content times is currently in progress. This indicates that the configuration of the semantic index was changed in a way that requires to rebuild the whole semantic index. This still requires the index to be active - meaning the searches can be performed normally - but recent updates/changes to ContentItems might not be reflected. This also indicates that the index will be replaced by a different version (maybe with changed fields) in the near future.

      Note that there are no states for INACTIVE and ERROR. This is because such kind of states are already convert by the normal OSGI component live-cycle. All the above IndexStates require the SemanticIndex component to be active.

          1. Index Inspection

      The semantic index interface provides a very simple API to inspect the configuration of the semantic index. This part of the Interface is considered to be optional. Implementations that can not provide such information shall return <code>null</code> to calls of the below methods.

      :::java
      /** The names of all fields defined by this Index */
      + getFieldsNames() : List<String>
      /** Getter for the field properties */
      + getFieldProperties(String name) : Map<String,Object>

      Keys for well known properties shall be defined by the services API of the ContentHub. This includes the following:

      :::java
      /** The xsd:dataType for the values of a field */
      DATATYPE

      Implementation specific keys shall be defined by the implementations of the semantic index interface. Here are possible keys for a LDPath based Semantic Index implementation

      :::java
      /** The LDPath rule used for a field */
      LDPATH

          1. Search

      The semantic index does NOT define methods to search it's contents as the intension is to directly use the search APIs of the technologies/framewoks used to hold the semantic index such as

      • [Apache Solr](http://lucene.apache.org/solr) RESTful API
      • SPARQL in case a TripleStore is used as Semantic index.
      • Contenthub featured search interface

      However the semantic index has two methods that can be used to get information about supported search interfaces.

      :::java
      /** Getter for all supported RESTful search endpoints */
      getRESTSearchEndpoints() : Map<String,String>
      /** Getter for all supported search components */
      getSearchEndpoints() : Map<Class,ServiceReference>

      The method returning the RESTful search interfaces uses a key representing the type of the RESTful service. The method returning the Components uses the Java interface (Class) as key and a OSGI ServiceReference to the actual component as value. The later is intended to be used by users that want to perform queries on the Contenthub by using the Java API.

      TODO: Define a set of properties that SemanticIndex implementations MUST add to search component so that users can also use normal ServiceTracker and @Reference annotations to use search components!

      e.g. the valued for the semantic index with the name "default" supporting SOLR and Contenthub featured search as RESTful search services

      :::text
      "CONTENTHUB" : "http://localhost:8080/contenthub/search/featured"
      "SOLR" : "http://localhost:8080/solr/contenthub/default"

      in addition the following search Components are supported

      :::text
      org.apache.stanbol.contenthub.servicesapi.search.featured.FeaturedSearch :

      {service-reference-instance}

      org.apache.stanbol.contenthub.servicesapi.search.solr.SolrSearch :

      {an-other-service-reference-instance}

      org.apache.solr.client.solrj.SolrServer :

      {an-service-reference-to-the-solr-server}

      An other example for an index with the name "knowledgebase" that supports an SPARQL endpoint

      :::text
      "SPARQL" : "http://localhost:8080/sparql/contenthub/knowledgebase"

      as RESTful service and

      :::text
      org.apache.clerezza.rdf.core.sparql.QueryEngine *)

      as component to perform SARQL queries.

      *) NOTE that the QueryEngine interface is used here only as example. A real implementation would need to wrap this by some other Interface that does not need the TcManager and TripleCollection to execute a query. Such two MUST be provided by the "knowledgebase" SemanticIndex.

      Semantic Index Management
      -------------------------

      Semantic Indexes are registered as OSGI component implementing the "SemanticIndex" interface as described above. All active semantic indexes are managed by the SemanticIndexManager component as follows:

          1. Interface

      Provides an Java API that allows to lookup of all active semantic indexes. This includes indexes in the UNINT, INDEXING, ACTIVE and REINDEXING state.

      Lookup of semantic index is supported based on name, and search endpoint type.

      :::java
      + getIndex(String name) : SemanticIndex
      + getIndexes(String name) : List<SemanticIndex>

      + getIndex(String endpointType) : SemanticIndex
      + getIndexes(String endpointType) : List<SemanticIndex>

      + getIndex(String name, String endpointType) : SemanticIndex
      + getIndexes(String name, String endpointType) : List<SemanticIndex>

      A typical query would be for an index with the name "simple" with the "SOLR" endpoint.

      :::java
      SemanticIndexManager indexManager;
      SemanticIndex index = indexManager.getIndex("simple", EndpointType.SOLR)
      String solrEndpoint = index.getSearchEndpoints().get(EndpointType.SOLR);

      The methods returning a single Index need to resolve cases with multiple matches by returning the SemanticIndex service

      1. with the highest "service.ranking" and
      2. the lowest "service.id

      This ensures the behavior to be consistent with the typical rules for service selection as defined by the OSGI specification.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rwesten Rupert Westenthaler
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated: