Details
-
Question
-
Status: Closed
-
Major
-
Resolution: Feedback Received
-
Jena 4.0.0, Jena 4.2.0, Jena 4.4.0
-
None
-
None
-
Java maximum memory: 12884901888
symbol:http://jena.apache.org/ARQ#regexImpl = symbol:http://jena.apache.org/ARQ#javaRegex
symbol:http://jena.apache.org/ARQ#registryFunctions = org.apache.jena.sparql.function.FunctionRegistry@1536602f
symbol:http://jena.apache.org/ARQ#constantBNodeLabels = true
symbol:http://jena.apache.org/ARQ#registryPropertyFunctions = org.apache.jena.sparql.pfunction.PropertyFunctionRegistry@4ebea12c
symbol:http://jena.apache.org/ARQ#stageGenerator = org.apache.jena.tdb2.solver.StageGeneratorDirectTDB@2a1edad4
symbol:http://jena.apache.org/ARQ#enablePropertyFunctions = true
symbol:http://jena.apache.org/ARQ#strictSPARQL = false
13:02:36 INFO loader :: Loader = LoaderParallel
13:02:36 INFO loader :: Start: 6 files
13:02:48 INFO loader :: Add: 500,000 bdmhistoricalrecords.nq (Batch: 40,361 / Avg: 40,361)
13:03:00 INFO loader :: Add: 1,000,000 bdmhistoricalrecords.nq (Batch: 44,907 / Avg: 42,513)
13:03:10 INFO loader :: Add: 1,500,000 bdmhistoricalrecords.nq (Batch: 47,980 / Avg: 44,191)
13:03:25 INFO loader :: Add: 2,000,000 bdmhistoricalrecords.nq (Batch: 32,486 / Avg: 40,539)
13:33:06 INFO loader :: Add: 2,500,000 bdmhistoricalrecords.nq (Batch: 280 / Avg: 1,366)
14:30:30 INFO loader :: Add: 3,000,000 bdmhistoricalrecords.nq (Batch: 145 / Avg: 568)
14:52:29 INFO loader :: Add: 3,500,000 bdmhistoricalrecords.nq (Batch: 378 / Avg: 530)Java maximum memory: 12884901888 symbol: http://jena.apache.org/ARQ#regexImpl = symbol: http://jena.apache.org/ARQ#javaRegex symbol: http://jena.apache.org/ARQ#registryFunctions = org.apache.jena.sparql.function.FunctionRegistry@1536602f symbol: http://jena.apache.org/ARQ#constantBNodeLabels = true symbol: http://jena.apache.org/ARQ#registryPropertyFunctions = org.apache.jena.sparql.pfunction.PropertyFunctionRegistry@4ebea12c symbol: http://jena.apache.org/ARQ#stageGenerator = org.apache.jena.tdb2.solver.StageGeneratorDirectTDB@2a1edad4 symbol: http://jena.apache.org/ARQ#enablePropertyFunctions = true symbol: http://jena.apache.org/ARQ#strictSPARQL = false 13:02:36 INFO loader :: Loader = LoaderParallel 13:02:36 INFO loader :: Start: 6 files 13:02:48 INFO loader :: Add: 500,000 bdmhistoricalrecords.nq (Batch: 40,361 / Avg: 40,361) 13:03:00 INFO loader :: Add: 1,000,000 bdmhistoricalrecords.nq (Batch: 44,907 / Avg: 42,513) 13:03:10 INFO loader :: Add: 1,500,000 bdmhistoricalrecords.nq (Batch: 47,980 / Avg: 44,191) 13:03:25 INFO loader :: Add: 2,000,000 bdmhistoricalrecords.nq (Batch: 32,486 / Avg: 40,539) 13:33:06 INFO loader :: Add: 2,500,000 bdmhistoricalrecords.nq (Batch: 280 / Avg: 1,366) 14:30:30 INFO loader :: Add: 3,000,000 bdmhistoricalrecords.nq (Batch: 145 / Avg: 568) 14:52:29 INFO loader :: Add: 3,500,000 bdmhistoricalrecords.nq (Batch: 378 / Avg: 530)
Description
Kia ora, Hi there,
We have been using tdb2.tdbloader to load ~400,000,000 triples into our triplestore - all the data is in nq format being previoiusly converted from JSONLD. The files we are loading range from ~10GB to ~50GB producing a triplestore ~180GB including a text index. We run the loader in an HPC environment so we can request as much memory as we need, often using 1TB to do the load. The job is run in a Singularity image (similar to docker) and slurm is the chosen workload manager.
All that aside, the load typically takes ~12-16hours but no more than 24 hours with --loader=parallel and an average rate of ~5,000 triples per second. We haven't needed to run the loader since October 2021, so upon recently running the load job again we are getting a grand average of about ~500triples per second. Haven't been able to wait and see if it even finishes.
Has anyone else experienced such a big performance loss with tdb2.tdbloader in the current or recent versions of jena? Apart from the potential investigation that can be done on the slurm/HPC side does anyone have advice around performance?
Thanks in advance