Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      In a Linked Data environment servers have to fetch data off the web. The speed at which such data
      is served can be very slow. So one wants to avoid using up one thread for each connections (1 thread =
      0.5 to 1MB approximately). This is why Java NIO was developed and why servers such as Netty
      are so popular, why http client libraries such as https://github.com/sonatype/async-http-client are more
      and more numerous, and why framewks such as http://akka.io/ which support relatively lightweight
      actors (500 bytes per actor) are growing more viisible.

      Unless I am mistaken the only way to parse some content is using methods that use an
      InputStream such as this:

      val m = ModelFactory.createDefaultModel()
      m.getReader(lang.jenaLang).read(m, in, base.toString)

      That read call blocks. Would it be possible to have an API which allows
      one to parse a document in chunks as they arrive from the input?

        Activity

        Hide
        andy.seaborne Andy Seaborne added a comment -

        Good timing - there are initial discussions readers recently on jena-users@. Please contribute and we can frame a more general JIRA.

        The more usual idiom is "m.read(m,in, base)" but the general mechanism you describe can be used with actor frameworks. m.getReader creates a reader that the app can pass (in a closure0like setup) to an actor.

        The RIOT parsers output to a Sink<Triple> which allows different architectures. RIOT encapsulates parsing as an algorithm so that algorithm can be executed on a separate thread/actor.

        What rendezvous style would you suggest?

        This does not seem to be "priority major" and until there is a patch available I suggest not marking it as such.

        Show
        andy.seaborne Andy Seaborne added a comment - Good timing - there are initial discussions readers recently on jena-users@. Please contribute and we can frame a more general JIRA. The more usual idiom is "m.read(m,in, base)" but the general mechanism you describe can be used with actor frameworks. m.getReader creates a reader that the app can pass (in a closure0like setup) to an actor. The RIOT parsers output to a Sink<Triple> which allows different architectures. RIOT encapsulates parsing as an algorithm so that algorithm can be executed on a separate thread/actor. What rendezvous style would you suggest? This does not seem to be "priority major" and until there is a patch available I suggest not marking it as such.
        Hide
        bblfish Henry Story added a comment -

        With a bit of help from Damian I did get the RDF/XML parser to be asynchronous using com.fasterxml.aalto asynchronous parser [1].

        I had to adapt Damian's jena.rdf.arp.StAX2SAX - which I called AsyncJenaParser [2] . This is then used by the URLFetcher class [3]. This class
        extends the async_http_client by ning [4], to fetch RDF.

        Currently it can only fetch RDF/XML, and with a bit more work, any XML format.

        What is missing is the Turtle parsers and JSON parsers

        The URLFetcher could be a bit more general and just pass on the data it receives to some actors. That would remove the parser processing from the IO thread, and allow the fetcher to be more general.

        There is perhaps something here that can be integrated by Jena. The AsyncJenaParser perhaps?

        Henry

        [1] http://www.cowtowncoder.com/blog/archives/2011/03/entry_451.html
        [2] https://dvcs.w3.org/hg/read-write-web/file/aa9074df0635/src/main/java/patch/AsyncJenaParser.java
        [3] https://dvcs.w3.org/hg/read-write-web/file/d9c1f87eee55/src/main/scala/cache/WebFetcher.scala
        [4] all classes can be found in the build file https://dvcs.w3.org/hg/read-write-web/file/aa9074df0635/project/build.scala

        Show
        bblfish Henry Story added a comment - With a bit of help from Damian I did get the RDF/XML parser to be asynchronous using com.fasterxml.aalto asynchronous parser [1] . I had to adapt Damian's jena.rdf.arp.StAX2SAX - which I called AsyncJenaParser [2] . This is then used by the URLFetcher class [3] . This class extends the async_http_client by ning [4] , to fetch RDF. Currently it can only fetch RDF/XML, and with a bit more work, any XML format. What is missing is the Turtle parsers and JSON parsers The URLFetcher could be a bit more general and just pass on the data it receives to some actors. That would remove the parser processing from the IO thread, and allow the fetcher to be more general. There is perhaps something here that can be integrated by Jena. The AsyncJenaParser perhaps? Henry [1] http://www.cowtowncoder.com/blog/archives/2011/03/entry_451.html [2] https://dvcs.w3.org/hg/read-write-web/file/aa9074df0635/src/main/java/patch/AsyncJenaParser.java [3] https://dvcs.w3.org/hg/read-write-web/file/d9c1f87eee55/src/main/scala/cache/WebFetcher.scala [4] all classes can be found in the build file https://dvcs.w3.org/hg/read-write-web/file/aa9074df0635/project/build.scala
        Show
        bblfish Henry Story added a comment - Ah I forgot this was discussed on the mailing list here: http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201201.mbox/%3C54563B60-702E-4748-B19E-9C3A0EDFBB1D%40bblfish.net%3E
        Hide
        bblfish Henry Story added a comment -
        Show
        bblfish Henry Story added a comment - Now there is a non blocking NTriples parser available here https://github.com/betehess/pimp-my-rdf/blob/master/n-triples-parser/src/main/scala/Parser.scala
        Hide
        bblfish Henry Story added a comment -
        Show
        bblfish Henry Story added a comment - And now a non blocking Turtle parer is available here too https://github.com/betehess/pimp-my-rdf/blob/d64ae11514f4bd8402c0857cb29c203ec821bd67/n3/src/main/scala/Turtle.scala with more detailed discussion on the W3C mailing list http://lists.w3.org/Archives/Public/public-rdf-comments/2012Feb/0043.html
        Hide
        andy.seaborne Andy Seaborne added a comment -

        Interesting stuff - I need to find a decent block of time to do more than just look.

        To go back to the title of this JIRA ...

        What can be done to "support non-blocking parsers" in addition to the current parsers. It seems to me that the non-block parsers scatter-gather paradigm is a separate subsystem on top of Jena - if there anything the core could provide to help?

        What I'd like to see is that Jena does not need to include every feature possible, but can support independent and vibrant open source projects (the developers have already talk a bit about some simple modularity while delivering combined collections in useful forms for common cases, like a single jar with everything in it or a single jar + dependencies to make using the command like tools much easier).

        (BTW the n-triples parser link is 404)

        Show
        andy.seaborne Andy Seaborne added a comment - Interesting stuff - I need to find a decent block of time to do more than just look. To go back to the title of this JIRA ... What can be done to "support non-blocking parsers" in addition to the current parsers. It seems to me that the non-block parsers scatter-gather paradigm is a separate subsystem on top of Jena - if there anything the core could provide to help? What I'd like to see is that Jena does not need to include every feature possible, but can support independent and vibrant open source projects (the developers have already talk a bit about some simple modularity while delivering combined collections in useful forms for common cases, like a single jar with everything in it or a single jar + dependencies to make using the command like tools much easier). (BTW the n-triples parser link is 404)
        Hide
        bblfish Henry Story added a comment -

        I am not sure what is the best way to change the Jena API for non blocking parsers, nor if anything needs to be done (yet). Essentially the way these parsers work is that one should be
        able to parse chunks of data, get some partial results (a small set of triples) and feed that to a Jena graph or store. Feeding it to a Jena Graph, or popping statements into a store one at a time is not a problem. So the XML parser I did above shows that it can be done with the jena rdf/xml parsers, and the turtle parser shows how one can do it with other frameworks that use Jena: after all the Turtle parser tests can add triples to Jena or Sesame graphs.

        But I think consciousness of this problem should help guide the direction of your thinking when developing new parsers, or what is needed to work with linked data in an efficient way.

        Out of doing this a few times an API will probably emerge.

        Currently I have a simple blocking interface API for the non blocking parser
        https://github.com/betehess/pimp-my-rdf/blob/248c8a13567e589308d1b7999570a14d6b530b20/n3/src/main/scala/TurtleReader.scala

        we all know this API. I need to find out how people in the actors community do this, and see what kind of pattern they agree is good. If I find that
        I'll post that here. Perhaps that will lead to some ideas of what such a pattern looks like.

        (The NTriples file moved. Here is the current snapshot link, which should be a permalink
        https://github.com/betehess/pimp-my-rdf/blob/248c8a13567e589308d1b7999570a14d6b530b20/n3/src/main/scala/NTriples.scala , but won't necessarily be the most up to date one )

        I'll keep you posted on further developments. I should try using these parsers in a real scenario soon, so I'll soon know how well this holds up.

        Show
        bblfish Henry Story added a comment - I am not sure what is the best way to change the Jena API for non blocking parsers, nor if anything needs to be done (yet). Essentially the way these parsers work is that one should be able to parse chunks of data, get some partial results (a small set of triples) and feed that to a Jena graph or store. Feeding it to a Jena Graph, or popping statements into a store one at a time is not a problem. So the XML parser I did above shows that it can be done with the jena rdf/xml parsers, and the turtle parser shows how one can do it with other frameworks that use Jena: after all the Turtle parser tests can add triples to Jena or Sesame graphs. But I think consciousness of this problem should help guide the direction of your thinking when developing new parsers, or what is needed to work with linked data in an efficient way. Out of doing this a few times an API will probably emerge. Currently I have a simple blocking interface API for the non blocking parser https://github.com/betehess/pimp-my-rdf/blob/248c8a13567e589308d1b7999570a14d6b530b20/n3/src/main/scala/TurtleReader.scala we all know this API. I need to find out how people in the actors community do this, and see what kind of pattern they agree is good. If I find that I'll post that here. Perhaps that will lead to some ideas of what such a pattern looks like. (The NTriples file moved. Here is the current snapshot link, which should be a permalink https://github.com/betehess/pimp-my-rdf/blob/248c8a13567e589308d1b7999570a14d6b530b20/n3/src/main/scala/NTriples.scala , but won't necessarily be the most up to date one ) I'll keep you posted on further developments. I should try using these parsers in a real scenario soon, so I'll soon know how well this holds up.
        Hide
        andy.seaborne Andy Seaborne added a comment -

        I can image it has no impact on existing Jena API. I think model.read() is the wrong way round and it should ReadEngine.read(model). The FileManager has that design ; I'm imaging "ReadEngine.many(model, set of places, waitForAll?)" would be the way to get from 500 places at once.

        The new WebReader I have ready to go also does that where there is only one RDF reader for all syntaxes and model.read routes to that. The syntax is merely a hint, and only one or several pieces of information used to determine the specific parser.

        If your parser does encounter a large, fast reply stream - how fast does it parse?

        Show
        andy.seaborne Andy Seaborne added a comment - I can image it has no impact on existing Jena API. I think model.read() is the wrong way round and it should ReadEngine.read(model). The FileManager has that design ; I'm imaging "ReadEngine.many(model, set of places, waitForAll?)" would be the way to get from 500 places at once. The new WebReader I have ready to go also does that where there is only one RDF reader for all syntaxes and model.read routes to that. The syntax is merely a hint, and only one or several pieces of information used to determine the specific parser. If your parser does encounter a large, fast reply stream - how fast does it parse?
        Hide
        bblfish Henry Story added a comment -

        I think at present it is a lot slower than the Jena and Sesame readers. There is probably still (I hope) a lot of optimisation that can be done... I learnt a lot doing it, but one does get to see in the end what the advantages of xml are...

        Show
        bblfish Henry Story added a comment - I think at present it is a lot slower than the Jena and Sesame readers. There is probably still (I hope) a lot of optimisation that can be done... I learnt a lot doing it, but one does get to see in the end what the advantages of xml are...
        Hide
        bblfish Henry Story added a comment -

        A lot slower means it is currently 10x slower. Small changes can make big differences in such parsers, but I won't have the time to tweak it now. If people would like to see how much they can improve they are welcome

        Show
        bblfish Henry Story added a comment - A lot slower means it is currently 10x slower. Small changes can make big differences in such parsers, but I won't have the time to tweak it now. If people would like to see how much they can improve they are welcome
        Hide
        claudenw Claude Warren added a comment -

        I was pondering this problem recently and was wondering about creating a new poling iterator class that returns True, False or NULL for hasNext(). The NULL being, no data yet.

        The idea is that each endpoint would be a thread fronted by a polling iterator that would plug into a poling iterator worker/pool/what-have-you the worker/pool/what-have-you would poll the endpoints until it got a TRUE or FALSE. On true it would return true for hasNext() and next() would return the result from the same endpoint. On false it would remove the endpoint from the pool, after last endpoint is removed it returns FALSE. On NULL it would move onto the next endpoint in the pool cycling back to the start when it reached the end.

        This should allow results from slower endpoints to be intermixed with results from faster endpoints and should increase the speed (decrease the time) to get all results.

        Show
        claudenw Claude Warren added a comment - I was pondering this problem recently and was wondering about creating a new poling iterator class that returns True, False or NULL for hasNext(). The NULL being, no data yet. The idea is that each endpoint would be a thread fronted by a polling iterator that would plug into a poling iterator worker/pool/what-have-you the worker/pool/what-have-you would poll the endpoints until it got a TRUE or FALSE. On true it would return true for hasNext() and next() would return the result from the same endpoint. On false it would remove the endpoint from the pool, after last endpoint is removed it returns FALSE. On NULL it would move onto the next endpoint in the pool cycling back to the start when it reached the end. This should allow results from slower endpoints to be intermixed with results from faster endpoints and should increase the speed (decrease the time) to get all results.
        Hide
        bblfish Henry Story added a comment -

        I think the correct data structure to look at is the Iteratee one from Functional programming.

        Here I wrote an RDFIteratee trait that has two implmenetations: one synchronous ( JenaSyncRDFIteratee ) and the other asynchronous ( JenaRdfXmlAsync )

        https://github.com/bblfish/Play20/blob/webid/framework/src/webid/src/main/scala/webid/rdf/RDFIteratee.scala

        This can then be used to write some very elegant code which can evolve as one gets better asynchronous parsers come along

        https://github.com/bblfish/Play20/blob/webid/framework/src/webid/src/main/scala/webid/GraphCache.scala#L133

        More documentation on Iteratees

        https://github.com/playframework/Play20/wiki/Iteratees

        Henry

        Show
        bblfish Henry Story added a comment - I think the correct data structure to look at is the Iteratee one from Functional programming. Here I wrote an RDFIteratee trait that has two implmenetations: one synchronous ( JenaSyncRDFIteratee ) and the other asynchronous ( JenaRdfXmlAsync ) https://github.com/bblfish/Play20/blob/webid/framework/src/webid/src/main/scala/webid/rdf/RDFIteratee.scala This can then be used to write some very elegant code which can evolve as one gets better asynchronous parsers come along https://github.com/bblfish/Play20/blob/webid/framework/src/webid/src/main/scala/webid/GraphCache.scala#L133 More documentation on Iteratees https://github.com/playframework/Play20/wiki/Iteratees Henry
        Hide
        andy.seaborne Andy Seaborne added a comment -

        Needs a separate API (and implementation).

        Show
        andy.seaborne Andy Seaborne added a comment - Needs a separate API (and implementation).

          People

          • Assignee:
            Unassigned
            Reporter:
            bblfish Henry Story
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development