Uploaded image for project: 'Clerezza (Retired)'
  1. Clerezza (Retired)
  2. CLEREZZA-683

Indexed in-memory graph

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • rdf.core
    • None

    Description

      1. Indexed in-memory graph

      Implementation of a TripleCollection that internally manages SPO, POS, OSP indexes for fast filtered iterators. The current state of development is hosted at http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/. However the intention is that this module becomes direct part of clerezza.

        1. Background:

      For Apache Stanbol having fast filtered iterators over in-memory graphs is really important, because Stanbol uses in-memory graph to store extracted metadata for parsed ContentItems.
      When enhancing longer texts with EnhancementChain configurations that produce a lot of enhancements (e.g. keyword extraction based on dbpedia) such in-memory graphs can get bigger than 100k triples. Especially if also triples for suggested entities are included within the result.

        1. Implementation:

      Because of that I started to implement an TripleCollection that used TreeMaps to manage SPO, POS, OSP indexes.

      For fast sorting (comparator) I use the same Resource#hashCode Resource#toString based solution as used in the rdf.rdfjson serializer. I hope this is also sufficient for Literals (someone should check that).

      The implementation of the "filter(..)" method is purely based on "NavigableSet.subSet(..).iterator()". I only need to wrap the iterator to ensure that by calls to Iterator.remove():

      1) Triples are removed from all three indexes
      2) GraphEvents are dispatched correctly

      Note also the trick with the two static fields UriRef MIN and UriRef MAX used to generate lower/upper bound triples as parsed to "NavigableSet.subSet(..)".

      The implementation is currently hosted on http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/

      It has no dependencies to Apache Stanbol. However users that do not want to check-out Stanbol as a whole will need to edit the pom.xml file and provide information usually imported from the parent poms.

        1. Tests:

      This implementation passes all MGraphTest UnitTests.
      In addition I have copied the tests define for SimpleTripleCollection

      To compare the performance I also implemented code that

      • allows to create a random Graph with n Triples
      • create a TestCase with configurable numbers of Subjects, Predicates and Objects
      • performs than m calls to #filter(...)

      This performance test runs also as UnitTest

      1. by using the SimpleMGraph implementation
      2. by using the IndexedMGraph implementation

      NOTE: While implementing this I recognized that the SimpleTripleCollectionTest does not extend MGraphTest and therefore the SimpleTripleCollection class is not checked against the tests defined by MGraphTest. This might actually an Issue!

        1. Performance

      This is a copy from a run of the above described PerformanceTest

      2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - Filter Performance Test (graph size 100000 triples, iterations 1000)
      2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - — TEST SimpleMGraph with 100000 triples —
      10694 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,O] in 8321ms with 2 results
      18052 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,n] in 7358ms with 734 results
      25318 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,O] in 7266ms with 100 results
      31837 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,O] in 6519ms with 232 results
      39236 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,n] in 7398ms with 8030 results
      45170 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,n] in 5934ms with 8318000 results
      55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,n,O] in 10666ms with 2260 results
      55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - — TEST completed in 53463ms
      55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - — TEST IndexedMGraph 100000 triples —
      55856 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,O] in 20ms with 2 results
      55875 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,P,n] in 19ms with 734 results
      55908 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,O] in 33ms with 100 results
      55936 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,O] in 28ms with 232 results
      55957 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [S,n,n] in 21ms with 8030 results
      57022 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,P,n] in 1065ms with 8318000 results
      57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - ... run [n,n,O] in 8ms with 2260 results
      57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - — TEST completed in 1194ms

      best
      Rupert

      Attachments

        Activity

          People

            Unassigned Unassigned
            rwesten Rupert Westenthaler
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: