Details
Description
Problem: I am transferring models via RDFConnection to TDB and seeing doubling of blank nodes in some graphs as though the same model is written a second time after a commit during the transfer. I apologize in advance for the length of this report.
Details: We have a collection of entity types: Persons, Items, Works and so on. Each entity is a graph in a ttl file in a per type git repo. For each type, the ttl files are read from the corresponding repo into models and the models are added to a Dataset until the number of triples in the dataset exceeds a threshold, e.g., 50,000 triples. When the threshold is exceeded then the dataset is loaded to Fuseki via an RDFConnection:
fuConn = RDFConnectionFactory.connect(baseUrl, baseUrl+"/query", baseUrl+"/update", baseUrl+"/data");
which is opened once at the beginning of loading all entity types. The kernel of loading is performed via:
private static void loadDatasetSimple(final Dataset ds) { if (!fuConn.isInTransaction()) { fuConn.begin(ReadWrite.WRITE); } fuConn.loadDataset(ds); fuConn.commit(); }
The loadDatasetSimple is called until all of the entities of a given type have been loaded from the corresponding repo. Since there may be some models not yet transferred after reading in all of the entities of a given type then a finish method is called:
static void finishDatasetTransfers() { // if map is not empty, transfer the last one if (currentDataset != null) { loadDatasetSimple(currentDataset); } }
After loading a given type of entity the next type in a list of types to transfer is processed as described above and this is when the problem is noticed.
Once enough models of the next type have been added to the transfer dataset and that dataset is transferred via loadDatasetSimple then some of the previously transferred graphs exhibit doubled blank nodes. Here is describe bdr:P58 to illustrate the doubling:
@prefix : <http://purl.bdrc.io/ontology/core/> . @prefix bdr: <http://purl.bdrc.io/resource/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix adm: <http://purl.bdrc.io/ontology/admin/> . bdr:P58 a :Person ; adm:gitRevision "e5e094dd8803f851448aac6ff3a800205ff8ef00" ; adm:status bdr:StatusReleased ; :hasFather bdr:P4342 ; :hasMother bdr:P4343 ; :personEvent [ a :PersonOccupiesSeat ; :personEventPlace bdr:G227 ] ; :personEvent [ a :PersonOccupiesSeat ; :personEventPlace bdr:G227 ] ; :personEvent [ a :PersonBirth ; :onOrAbout "1402" ; :personEventPlace bdr:G547 ] ; :personEvent [ a :PersonOccupiesSeat ; :personEventPlace bdr:G235 ] ; :personEvent [ a :PersonOccupiesSeat ; :personEventPlace bdr:G235 ] ; :personEvent [ a :PersonDeath ; :onOrAbout "1472" ] ; :personEvent [ a :PersonDeath ; :onOrAbout "1472" ] ; :personEvent [ a :PersonBirth ; :onOrAbout "1402" ; :personEventPlace bdr:G547 ] ; :personGender bdr:GenderMale ; :personName [ a :PersonPrimaryTitle ; rdfs:label "spyan snga blo gros rgyal mtshan/"@bo-x-ewts ] ; :personName [ a :PersonPrimaryTitle ; rdfs:label "spyan snga blo gros rgyal mtshan/"@bo-x-ewts ] ; :personName [ a :PersonChineseName ; rdfs:label "金厄·洛卓坚赞"@zh ] ; :personName [ a :PersonTitle ; rdfs:label "rgya ma spyan snga ba blo gros rgyal mtshan/"@bo-x-ewts ] ; :personName [ a :PersonPrimaryName ; rdfs:label "blo gros rgyal mtshan/"@bo-x-ewts ] ; :personName [ a :PersonTitle ; rdfs:label "rgya ma spyan snga ba blo gros rgyal mtshan/"@bo-x-ewts ] ; :personName [ a :PersonPrimaryName ; rdfs:label "blo gros rgyal mtshan/"@bo-x-ewts ] ; :personName [ a :PersonFirstOrdinationName ; rdfs:label "blo gros rgyal mtshan/"@bo-x-ewts ] ; :personName [ a :PersonChineseName ; rdfs:label "金厄·洛卓坚赞"@zh ] ; :personName [ a :PersonFirstOrdinationName ; rdfs:label "blo gros rgyal mtshan/"@bo-x-ewts ] ; skos:prefLabel "blo gros rgyal mtshan/"@bo-x-ewts .
This doubling is completely reproducible and the same graphs exhibit doubling on each trial.
Varying the threshold changes which graphs and how many graphs exhibit doubling. If the threshold is set higher, e.g., to 100,000 triples per call to loadDatasetSimple then many more graphs exhibit doubling. If the threshold is set lower, say to 20,000 triples, then fewer graphs exhibit doubling. If only a single model at-a-time is transferred then there is no doubling,
Also if each type of entity is transferred separately - opening the connection, transferring all models of the type, then closing down via:
public static void closeConnections() { TransferHelpers.logger.info("closeConnections fuConn.commit, end, close"); FusekiHelpers.fuConn.commit(); FusekiHelpers.fuConn.end(); FusekiHelpers.fuConn.close(); }
There is no doubling.
It appears that models that have already been transferred and committed are being written a second time when switching to a new type and upon the first transfer via loadDatasetSimple of the new type.
I'm hoping there's enough information in this report to identify what sort of error in usage of RDFConnection and/or TDB would account for this behavior. If this appears to be a bug in Jena then I will have to expend more effort to create a relatively self-contained test case.
Here is the relevant portion of the Fuseki configuration:
@prefix fuseki: <http://jena.apache.org/fuseki#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> . @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> . @prefix : <http://base/#> . @prefix text: <http://jena.apache.org/text#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . [] rdf:type fuseki:Server ; fuseki:services ( :bdrcrw ) . :bdrcrw rdf:type fuseki:Service ; fuseki:name "bdrcrw" ; # name of the dataset in the url fuseki:serviceQuery "query" ; # SPARQL query service fuseki:serviceUpdate "update" ; # SPARQL update service fuseki:serviceUpload "upload" ; # Non-SPARQL upload service fuseki:serviceReadWriteGraphStore "data" ; # SPARQL Graph store protocol (read and write) fuseki:dataset :bdrc_text_dataset ; . :bdrc_text_dataset rdf:type text:TextDataset ; text:dataset :dataset_bdrc ; text:index :bdrc_lucene_index ; . :dataset_bdrc rdf:type tdb:DatasetTDB ; tdb:location "/etc/fuseki/databases/bdrc" ; tdb:unionDefaultGraph true ; .