Uploaded image for project: 'Apache Jena'
  1. Apache Jena
  2. JENA-1848

Trig Writer slow; doesn't scale to many graphs

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • Jena 3.14.0
    • Jena 3.15.0
    • RIOT
    • None

    Description

      The following code for loading 1.000.000 graphs takes 1 minute on my notebook, but I stopped my attempt of writing the data out as trig after several hours.

      Dataset ds = RDFDataMgr.loadDataset("test-data.trig");
      RDFDataMgr.write(new NullOutputStream(), ds, RDFFormat.TRIG_PRETTY);
      

      In comparison, writing takes 2 seconds for me with RDFFormat.NQUADS.

      The test data I used can be generated with this gendata.sh bash script:

      #!/bin/bash
      MAX=${1:-10}
      echo "@prefix eg: <http://www.example.org/> ."
      for i in `seq 1 $MAX`; do
        echo "<urn:graph-$i> { <urn:s-$i> eg:idx $i }"
      done
      

      Invoke the script with the number of named graphs to generate, in my case I used

      ./gendata.sh 1000000 > test-data.trig`
      

      With the profiler I could trace the problem to code in TurtleShell.java which repeatedly collects all one million graph names :

      this.graphNames = (dsg != null) ? Iter.toSet(dsg.listGraphNodes()) : null ;`
      

      Attachments

        Activity

          People

            andy Andy Seaborne
            Aklakan Claus Stadler
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 40m
                40m