[STANBOL-1016] Add RDF Triple Filter support to the Jena TDB Indexing Source - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.12.0
Component/s: Entityhub
Labels:
None

Description

The freebase.com dump has ~1.200.000.000 triples. Loading those triples to Jena TDB takes ages if the RAM (available to the memory mapped files) is not huge enough to hold the data. If the number of imported triples exceeds the available RAM the import speed deceases to ~7k triples/sec on an SSD. For reaching those 7k triple/sec the logs show 1,5k reads and 1k writes per second so import speeds on normal hard discs should be much slower.

As most of the Triples contained in the freebase dump are not relevant for indexing this issue will introduce a new feature to the Jena TDB Indexing Source that allows - on a very low level - to filter out triples.

This Filter will be based on Triples provided by the Riot parser and define a single method

accept(Node subject, Node predicate, Node object) : boolean

In addition the interface will extend IndexingComponent, what will allow to configure it via the configuration file of the

org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource

The parameter used to configure the filter will be called "import-filter" and the value MUST BE the Class name of the used implementation.

The configuration of the jenatdb.RdfIndexingSource will be parsed to the Import Filters #setConfiguration(..) method. This means that users will need to add configuration properties of for the Import Filter to the configuration of the RdfIndexingSource.

To keep things simple the RdfImportFilter interface will be specific to the Jena TDB Indexing Source.

Attachments

Activity

People

Assignee:: Rupert Westenthaler

Reporter:: Rupert Westenthaler

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Apr/13 06:45

Updated:: 17/Jul/13 15:07

Resolved:: 15/Apr/13 07:23