Apache Jena
  1. Apache Jena
  2. JENA-205

Streaming results for CONSTRUCT queries

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ARQ, Fuseki
    • Labels:
      None

      Description

      It would be useful to have CONSTRUCT queries that streamed results. An additional method on QueryExecution that returned an Iterator<Statement> (or something similar to that [1]) would provide the necessary access.

      Implementationwise, the application of Bindings to the CONSTRUCT template is already streaming, we would simply need to perform a distinct operation on the Triples that are created. We could use a DistinctDataNet to get semi-streaming with spill-to-disk functionality.

      Additionally, for this to be useful for Fuseki, we also need an RDF/XML serializer that can operate on an Iterator<Statement> instead of a Model.

      [1] Prefix mappings would probably be nice for serializers that consume this iterator.

        Issue Links

          Activity

          Hide
          Andy Seaborne added a comment -

          Personally, I'd go for Iterator<Triple>. Statements are tied to model; instead, provide a stream of things that are not tied to models AKA triples (and quads if we extend CONSTRUCT to have GRAPH like SPARQL Updates).

          I also thing that making it distinct should be left out or at most optional. This is a low level interface.

          N-triples is easy to stream out, as is Turtle. RDF/XML is hard enough to warrant not bothering with for now. I think that Turtle will quickly displace RDF/XML.

          For Fuseki, a trick would be to stream back triples with duplicates - it's still legal RDF in all serializations to have duplicates. Then the server is not burdened.

          Show
          Andy Seaborne added a comment - Personally, I'd go for Iterator<Triple>. Statements are tied to model; instead, provide a stream of things that are not tied to models AKA triples (and quads if we extend CONSTRUCT to have GRAPH like SPARQL Updates). I also thing that making it distinct should be left out or at most optional. This is a low level interface. N-triples is easy to stream out, as is Turtle. RDF/XML is hard enough to warrant not bothering with for now. I think that Turtle will quickly displace RDF/XML. For Fuseki, a trick would be to stream back triples with duplicates - it's still legal RDF in all serializations to have duplicates. Then the server is not burdened.
          Hide
          Rob Vesse added a comment -

          Streaming RDF/XML is not that hard if you are willing to either have very verbose output i.e. a <rdf:Description> element per triple or to do some limited buffering and write out in small batches.

          You'd just have to be careful that you always add rdf:nodeId where blank nodes are involved so that you don't rely on anonymous identifiers at all.

          I'd very much like to see a streaming API for this as I have at least one piece of code where I want that capability and haven't had the time to try and implement it myself yet. Iterator<Triple> would be my preference as well and I agree that doing the distinct on the server is unnecessary.

          Show
          Rob Vesse added a comment - Streaming RDF/XML is not that hard if you are willing to either have very verbose output i.e. a <rdf:Description> element per triple or to do some limited buffering and write out in small batches. You'd just have to be careful that you always add rdf:nodeId where blank nodes are involved so that you don't rely on anonymous identifiers at all. I'd very much like to see a streaming API for this as I have at least one piece of code where I want that capability and haven't had the time to try and implement it myself yet. Iterator<Triple> would be my preference as well and I agree that doing the distinct on the server is unnecessary.
          Hide
          Stephen Allen added a comment -

          That makes sense about Triples instead of Statements.

          I note the following statement from the SPARQL spec [1]:

          "The CONSTRUCT query form returns a single RDF graph specified by a graph
          template. The result is an RDF graph formed by taking each query solution in
          the solution sequence, substituting for the variables in the graph template, and
          combining the triples into a single RDF graph by set union."

          The last sentence specifically calls for a set union rather than a bag union. Wouldn't having duplicates be non-standard? Not that I think it matters too much as you say, since most people will probably put the results into some sort of RDF data model.

          Would it make sense for a query language extension like "CONSTRUCT" and "CONSTRUCT DISTINCT" to specify which you wanted? Although given that it is already defined as a set in SPARQL 1.0, it might have to be something ugly like "CONSTRUCT <something>" where something could be "BAG" or "REDUCED" or "DUPLICATES_ALLOWED"

          Additionally, I can imagine future features like CONSTRUCT JSON, where it could be more plausible that you definitely want to eliminate duplicates, since that feature could be used without having a heavyweight RDF library available to parse the results.

          [1] http://www.w3.org/TR/rdf-sparql-query/#construct

          Show
          Stephen Allen added a comment - That makes sense about Triples instead of Statements. I note the following statement from the SPARQL spec [1] : "The CONSTRUCT query form returns a single RDF graph specified by a graph template. The result is an RDF graph formed by taking each query solution in the solution sequence, substituting for the variables in the graph template, and combining the triples into a single RDF graph by set union." The last sentence specifically calls for a set union rather than a bag union. Wouldn't having duplicates be non-standard? Not that I think it matters too much as you say, since most people will probably put the results into some sort of RDF data model. Would it make sense for a query language extension like "CONSTRUCT" and "CONSTRUCT DISTINCT" to specify which you wanted? Although given that it is already defined as a set in SPARQL 1.0, it might have to be something ugly like "CONSTRUCT <something>" where something could be "BAG" or "REDUCED" or "DUPLICATES_ALLOWED" Additionally, I can imagine future features like CONSTRUCT JSON, where it could be more plausible that you definitely want to eliminate duplicates, since that feature could be used without having a heavyweight RDF library available to parse the results. [1] http://www.w3.org/TR/rdf-sparql-query/#construct
          Hide
          Andy Seaborne added a comment -

          I guess I don't see execTriples() on the same level as execConstruct(). I see it as a lower level mechanism to give streaming where Iterator<Triple> is not claiming to be a graph. That is, it is knowingly non-standard.

          We could useful change execConstruct() (but not execConstruct(Model)) so it uses the data bag machinery.

          UNREDUCED would be a possible keyword choice

          Show
          Andy Seaborne added a comment - I guess I don't see execTriples() on the same level as execConstruct(). I see it as a lower level mechanism to give streaming where Iterator<Triple> is not claiming to be a graph. That is, it is knowingly non-standard. We could useful change execConstruct() (but not execConstruct(Model)) so it uses the data bag machinery. UNREDUCED would be a possible keyword choice
          Hide
          Stephen Allen added a comment -

          I've added execConstructTriples() to ARQ to provide streaming support for both local and remote query execution (revision 1305184).

          Last thing left to do is add streaming to Fuseki. The bulk of the work here would involve modifying existing or creating new RDF writers to accept Iterator<Triple> instead of a Model.

          Show
          Stephen Allen added a comment - I've added execConstructTriples() to ARQ to provide streaming support for both local and remote query execution (revision 1305184). Last thing left to do is add streaming to Fuseki. The bulk of the work here would involve modifying existing or creating new RDF writers to accept Iterator<Triple> instead of a Model.
          Hide
          Hudson added a comment -

          Integrated in Jena_ARQ #517 (See https://builds.apache.org/job/Jena_ARQ/517/)
          JENA-205 (Streaming results for CONSTRUCT queries). Added support to ARQ for streaming CONSTRUCT queries for both local and remote query execution. (Revision 1305184)

          Result = SUCCESS
          sallen :
          Files :

          • /incubator/jena/Jena2/ARQ/trunk/src/main/java/com/hp/hpl/jena/query/QueryExecution.java
          • /incubator/jena/Jena2/ARQ/trunk/src/main/java/com/hp/hpl/jena/sparql/engine/QueryExecutionBase.java
          • /incubator/jena/Jena2/ARQ/trunk/src/main/java/com/hp/hpl/jena/sparql/engine/http/QueryEngineHTTP.java
          • /incubator/jena/Jena2/ARQ/trunk/src/main/java/com/hp/hpl/jena/sparql/modify/TemplateLib.java
          • /incubator/jena/Jena2/ARQ/trunk/src/main/java/org/openjena/riot/RiotParsePuller.java
          • /incubator/jena/Jena2/ARQ/trunk/src/main/java/org/openjena/riot/RiotQuadParsePuller.java
          • /incubator/jena/Jena2/ARQ/trunk/src/main/java/org/openjena/riot/RiotTripleParsePuller.java
          Show
          Hudson added a comment - Integrated in Jena_ARQ #517 (See https://builds.apache.org/job/Jena_ARQ/517/ ) JENA-205 (Streaming results for CONSTRUCT queries). Added support to ARQ for streaming CONSTRUCT queries for both local and remote query execution. (Revision 1305184) Result = SUCCESS sallen : Files : /incubator/jena/Jena2/ARQ/trunk/src/main/java/com/hp/hpl/jena/query/QueryExecution.java /incubator/jena/Jena2/ARQ/trunk/src/main/java/com/hp/hpl/jena/sparql/engine/QueryExecutionBase.java /incubator/jena/Jena2/ARQ/trunk/src/main/java/com/hp/hpl/jena/sparql/engine/http/QueryEngineHTTP.java /incubator/jena/Jena2/ARQ/trunk/src/main/java/com/hp/hpl/jena/sparql/modify/TemplateLib.java /incubator/jena/Jena2/ARQ/trunk/src/main/java/org/openjena/riot/RiotParsePuller.java /incubator/jena/Jena2/ARQ/trunk/src/main/java/org/openjena/riot/RiotQuadParsePuller.java /incubator/jena/Jena2/ARQ/trunk/src/main/java/org/openjena/riot/RiotTripleParsePuller.java
          Hide
          Andy Seaborne added a comment -

          Looks good.

          I moved the Sink-to-queue into the general library.

          More of an observation; not a proposal to do anything:

          The Turtle/TriG parsers could provide a pull interface if they parsed small blocks, that is from subject up to a DOT, and served those out of a queue, making them like N-Triples and Q-Quads. Then the pull arrangement would not need an intermediate thread. The potential downside of no buffering above the HTTP level is avoided because the tokenizer has a large buffer.

          But this would be a bit of reorganization of the TurtleBase parser – triplesSameSubject would need to be turned into a capturing loop that may affect performance of push mode.

          Show
          Andy Seaborne added a comment - Looks good. I moved the Sink-to-queue into the general library. More of an observation; not a proposal to do anything: The Turtle/TriG parsers could provide a pull interface if they parsed small blocks, that is from subject up to a DOT, and served those out of a queue, making them like N-Triples and Q-Quads. Then the pull arrangement would not need an intermediate thread. The potential downside of no buffering above the HTTP level is avoided because the tokenizer has a large buffer. But this would be a bit of reorganization of the TurtleBase parser – triplesSameSubject would need to be turned into a capturing loop that may affect performance of push mode.
          Hide
          Hudson added a comment -

          Integrated in Jena_ARQ #526 (See https://builds.apache.org/job/Jena_ARQ/526/)
          JENA-205 (Streaming results for CONSTRUCT queries). Added iterator versions of NQuadsWriter and NTriplesWriter. (Revision 1306717)

          Result = SUCCESS
          sallen :
          Files :

          • /incubator/jena/Jena2/ARQ/trunk/src/main/java/org/openjena/riot/RiotWriter.java
          • /incubator/jena/Jena2/ARQ/trunk/src/main/java/org/openjena/riot/out/NQuadsWriter.java
          • /incubator/jena/Jena2/ARQ/trunk/src/main/java/org/openjena/riot/out/NTriplesWriter.java
          Show
          Hudson added a comment - Integrated in Jena_ARQ #526 (See https://builds.apache.org/job/Jena_ARQ/526/ ) JENA-205 (Streaming results for CONSTRUCT queries). Added iterator versions of NQuadsWriter and NTriplesWriter. (Revision 1306717) Result = SUCCESS sallen : Files : /incubator/jena/Jena2/ARQ/trunk/src/main/java/org/openjena/riot/RiotWriter.java /incubator/jena/Jena2/ARQ/trunk/src/main/java/org/openjena/riot/out/NQuadsWriter.java /incubator/jena/Jena2/ARQ/trunk/src/main/java/org/openjena/riot/out/NTriplesWriter.java
          Hide
          Andy Seaborne added a comment -

          Is this finished and closable? or maybe split out remaining items and create new JIRA?

          Show
          Andy Seaborne added a comment - Is this finished and closable? or maybe split out remaining items and create new JIRA?
          Hide
          Stephen Allen added a comment -

          Streaming CONSTRUCT queries are now supported at an API level. Therefore this issue is FIXED.

          There is a related issue of allowing Fuseki to stream the results back to the client. That is tracked in JENA-329.

          Show
          Stephen Allen added a comment - Streaming CONSTRUCT queries are now supported at an API level. Therefore this issue is FIXED. There is a related issue of allowing Fuseki to stream the results back to the client. That is tracked in JENA-329 .

            People

            • Assignee:
              Stephen Allen
              Reporter:
              Stephen Allen
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development