Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
Jena 2.12.1
-
None
Description
Discovered this accidentally in the course of debugging a test case which seemed to running exceptionally slow. Basically the in-memory DatasetGraph implementation will perform extremely poorly when trying to iterate all quads it contains.
This problem manifests at relatively small scales and shows a marked difference between small numbers of quads in a single graph versus the quads in individual graphs. In my testing with only a few thousand quads placing them in a single graph versus many individual graphs can see 5/6 orders of magnitude performance difference.
I will attach code to reproduce and show my log output from running this on my system (Mac OS X Mavericks - Quad Core - 8GB RAM):
0 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Testing with 1000 quads 57 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main 90 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 1000 quads in 1 graphs 123 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 0.021191 seconds to iterate over 1000 quads in 1 graphs 123 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main 144 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 1000 quads in 1000 graphs 861 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 0.716750 seconds to iterate over 1000 quads in 1000 graphs 861 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Testing with 2000 quads 861 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main 885 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 2000 quads in 1 graphs 890 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 0.005400 seconds to iterate over 2000 quads in 1 graphs 890 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main 922 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 2000 quads in 2000 graphs 7925 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 7.002919 seconds to iterate over 2000 quads in 2000 graphs 7925 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Testing with 3000 quads 7925 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main 7944 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 3000 quads in 1 graphs 7948 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 0.003924 seconds to iterate over 3000 quads in 1 graphs 7948 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main 7973 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 3000 quads in 3000 graphs 36723 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 28.749688 seconds to iterate over 3000 quads in 3000 graphs 36723 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Testing with 4000 quads 36723 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main 36727 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 4000 quads in 1 graphs 36730 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 0.000773 seconds to iterate over 4000 quads in 1 graphs 36730 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main 36741 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 4000 quads in 4000 graphs 111638 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 74.896949 seconds to iterate over 4000 quads in 4000 graphs 111638 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Testing with 5000 quads 111638 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main 111644 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 5000 quads in 1 graphs 111645 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 0.000989 seconds to iterate over 5000 quads in 1 graphs 111645 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main 111655 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 5000 quads in 5000 graphs 285026 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 173.370614 seconds to iterate over 5000 quads in 5000 graphs
From debugging my best guess is this is down to how DatasetGraphCollection (and it's hierarchy) build the quads iterator by iterating over the graph names and then iterating over the individual graphs appending the iterators together. Something in the structuring means accessing and appending all those individual iterators is drastically slow compared to iterating a single graph.
A related bug is that given a sufficiently large number of graphs attempts to iterate will fail with a StackOverflowError, the culprit in this case is IteratorCons.