Description
The freebase.com dump has ~1.200.000.000 triples. Loading those triples to Jena TDB takes ages if the RAM (available to the memory mapped files) is not huge enough to hold the data. If the number of imported triples exceeds the available RAM the import speed deceases to ~7k triples/sec on an SSD. For reaching those 7k triple/sec the logs show 1,5k reads and 1k writes per second so import speeds on normal hard discs should be much slower.
As most of the Triples contained in the freebase dump are not relevant for indexing this issue will introduce a new feature to the Jena TDB Indexing Source that allows - on a very low level - to filter out triples.
This Filter will be based on Triples provided by the Riot parser and define a single method
accept(Node subject, Node predicate, Node object) : boolean
In addition the interface will extend IndexingComponent, what will allow to configure it via the configuration file of the
org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource
The parameter used to configure the filter will be called "import-filter" and the value MUST BE the Class name of the used implementation.
The configuration of the jenatdb.RdfIndexingSource will be parsed to the Import Filters #setConfiguration(..) method. This means that users will need to add configuration properties of for the Import Filter to the configuration of the RdfIndexingSource.
To keep things simple the RdfImportFilter interface will be specific to the Jena TDB Indexing Source.