Uploaded image for project: 'Stanbol (Retired)'
  1. Stanbol (Retired)
  2. STANBOL-593

EntityIterator implementation based on Jena TDB that allows to filter Entities based on Triple Filters

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.9.0-incubating
    • Entityhub
    • None

    Description

      The FieldValueProcessor (EntityProcessor) already allows to filter Entities based on Triple Filters. However this requires to Iterate over all entities - something very ineffective if one wants only to index a rather small fraction of all Entities.

      To achieve better performance in such cases one needs an Component that uses a similar functionality to filter Entities within the Indexing Source. Such an implementation is very easy to implement based on Jena TDB as the low level API natively supports filtered iterators.

      Indexing configurations would than use a EntityIterator/EntityDataProvider combination as source for the indexing. A according configuration would look like

      entityIdIterator=org.apache.stanbol.entityhub.indexing.source.jenatdb.ResourceFilterIterator,config:entityTypes.properties
      entityDataProvider=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,source:rdfdata

      the entityTypes.properties file would require the following properties

      field=rdf:type
      values=dbp-ont:Person;dbp-ont:Place;dbp-ont:Organisation

      With this configuration the indexing process would only iterate over Persons, Places and Organisations present within the IndexingSource.

      Attachments

        Activity

          People

            rwesten Rupert Westenthaler
            rwesten Rupert Westenthaler
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: