[JENA-1848] Trig Writer slow; doesn't scale to many graphs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: Jena 3.14.0
Fix Version/s: Jena 3.15.0
Component/s: RIOT
Labels:
None

Description

The following code for loading 1.000.000 graphs takes 1 minute on my notebook, but I stopped my attempt of writing the data out as trig after several hours.

Dataset ds = RDFDataMgr.loadDataset("test-data.trig");
RDFDataMgr.write(new NullOutputStream(), ds, RDFFormat.TRIG_PRETTY);

In comparison, writing takes 2 seconds for me with RDFFormat.NQUADS.

The test data I used can be generated with this gendata.sh bash script:

#!/bin/bash
MAX=${1:-10}
echo "@prefix eg: <http://www.example.org/> ."
for i in `seq 1 $MAX`; do
  echo "<urn:graph-$i> { <urn:s-$i> eg:idx $i }"
done

Invoke the script with the number of named graphs to generate, in my case I used

./gendata.sh 1000000 > test-data.trig`

With the profiler I could trace the problem to code in TurtleShell.java which repeatedly collects all one million graph names :

this.graphNames = (dsg != null) ? Iter.toSet(dsg.listGraphNodes()) : null ;`

Attachments

Issue Links

links to

GitHub Pull Request #696

GitHub Pull Request #697

Activity

People

Assignee:: Andy Seaborne

Reporter:: Claus Stadler

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Feb/20 03:46

Updated:: 19/May/20 10:26

Resolved:: 02/Mar/20 16:50

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m