I have a need for an RDF library in several different projects, and really like the idea of a commons-rdf. However, I'm not sure the current proposal provides the functionality that is actually necessary.
How should Java code interact with RDF? The most common case will be via SPARQL. So a common SPARQL client library with a convenient API, support for persistent connections, SSL, basic auth, etc etc, would be a very valuable thing.
Another case will be to parse (or generate RDF). For this, a simple streaming interface would be perfect.
An API for accessing RDF as an object model I have to say I'm deeply skeptical of, for two reasons. The first reason is that it's very rarely a good idea. In the vast majority of cases, your data either is in a database or should be in database. SPARQL is the right answer in these cases.
The second reason is that I see many people adoping this API approach to RDF even when they obviously should not. The reason seems to be that developers want an API, and given an API that's what they choose. Even when, architecturally, this is crazy. As a point of comparison, it's very rare for people to interact with relation data via interfaces named Database, Relation, Row, etc. But for RDF this has somehow become the norm. Some of the triple store vendors (like Oracle) even boast of supporting the Jena APIs, even though one should under no circumstances use APIs of that kind to work with triples stored in Oracle.
So my fear is that an API like the one currently proposed will not only fail to provide the functionality that is most commonly needed, but also lead developers astray.
I guess this is probably not the most pleasant feedback to receive, but I felt it had to be said. Sorry about that.
No need to apologise (@wikier and I asked you to expand on your Twitter comments!)
From my perspective, I would love to port (and improve where necessary) RDFHandler from Sesame to Commons RDF. However, we felt that it was not applicable in the first version that we requested comments from people, based on a very narrow scope of solely relying on the RDF-1.1 Abstract Model terminology.
As you point out, the level of terminology used in the Abstract Model is too low for common application usage. @afs has pointed out difficulties with using Graph as the actual access layer for a database. In Sesame, the equivalent Graph interface is never used for access to permanent data stores, only for in-memory filtering and analysis between the database and users, which happens fairly often in my applications so I am glad that it exists.
A good fast portable SPARQL client library would still need an object model to represent the results in, to send them to a typesafe API. Before we do that we wanted to get the object model to a relatively mature stage.
From this point on we have a few paths that we can follow to expand out to an RDF Streaming API and a SPARQL client library, particularly as we have a focus on Java-8 with Lambdas.
For example, we could have something like:
Usage may be:
Could you suggest a few possible alternative models that would suit you and critique that model?
Commons-RDF allows an application to switch bewteen implmentations. A variation of @larsga's point is that SPARQL (languages and protocols) gives that separation already.
There are processing models that are not so SPARQL-amenable, such as graph some analytics (think map/reduce or RDD), where handling the data at the RDF 1.1 data model is important and then the RDF graph does matter as a concept because the application wshed to walk the graph, following links.
What would make working with SPARQL easier, but does not need portablity, needed is mini-languages that make SPARQL easier to write in programs, maybe specialised to particular usage patterns. There is no need for mega-toolkits everywhere.
@larsga - what's in your ideal RDF library?
(To Oracle, and others, "Jena API", includes the SPARQL interface then how to deal with the results.)
What's special about a common SPARQL client is that none seems to exist in Java at the moment. So if commons-rdf could provide one that would be great.
Getting results via JDBC may be preferable in some cases, but in general it's not ideal. How do you get the result as it really was in that case? With data type URIs and language tags? How do you get the parsed results of CONSTRUCT queries? In addition, the API is not very convenient.
jena-jdbc requires jena-jdbc-core, which in turn requires ARQ, which then requires ... That's a non-starter. If I simply want to send SPARQL queries over HTTP having to pull in the entire Jena stack is just not on.
> There are processing models that are not so SPARQL-amenable, such as graph some analytics (think map/reduce or RDD), where handling the data at the RDF 1.1 data model is important and then the RDF graph does matter as a concept because the application wshed to walk the graph, following links.
Yes. This is a corner case, though, and it's very far from obvious that a full-blown object model for graph traversal is the best way to approach this. Or that it will even scale. But never mind that.
What's missing in the Java/RDF space is the main tools you really need to build an RDF application in Java: streaming API to parsers plus a SPARQL client. Something like this can be provided very easily in a very light-weight package, and would provide immense value.
An object model representing the RDF Data Model directly would, imho, do more harm than good, simply because it would mislead people into thinking that this is the right way to interact with RDF in general.
At the minimum, you don't need anything other than an HTTP client and retrieve JSON!
If you want to work in RDF concepts in your application, Jena provides streaming parsers plus a SPARQL client as does Sesame. Each provides exactly what you describe! Yes, a system that was minimal would be smaller but (1) is the size difference that critical (solution - strip down a toolkit); data is large, code is small and (2) in what way is it not yet-another-toolkit, and all that goes with that?
OK, I think now I understand @larsga's point...
I do agree that SPARQL should be in theory such "common interface". But what happens right now it that each library serializes the results using their own terms. So one of the goals of commons-rdf would be to align the interfaces there too.
Of course you could always say you can be decoupled by parsing the results by yourself. But that has two problems: On the one hand, you are reimplementing code you do not need and probably making mistakes. On the other hand, that only works if your code is not going to be used by anyone else; as soon as it's going to be used, instead of solving a problem your are causing another one.
In case this helps for the discussion, we discussed the idea of commons-rdf because in two following weeks I had to deal with the same problem: I needed to provide a simple client library and I realized the decision I made in terms of which library I chose forced people to use that library.
Those two client libraries are the Redlink SDK and MICO. Both with different purposes and different targets, but in the end dealing with the same problem.
Yes, this is getting closer to what I meant. As you say, a SPARQL client library is fine for stuff like ASK, SELECT, INSERT and so on. The problem is CONSTRUCT, or if you want to parse a file. However, even in those cases I do not want an in-memory representation of the resulting RDF. I want it streamed, kind of like SAX for XML. Then, if I need an in-memory representation I will build one from the resulting stream.
Now if you argue that there will be people for whom an in-memory representation is the best choice I guess that's OK. But I think it's wrong to force people to go via such a representation. Ideally, I'd like to see:
- a simple streaming interface,
- a simple abstraction for parsers and writers,
- a SPARQL client library that represents RDF as callback streams.
If there also has to be a full-blown API with Graph, Statement, and the like, so be it. But it would IMHO be best if that were layered on top of the rest as an option, so that if you wanted you could build such a model by streaming statements into it, but you wouldn't be forced to go via those interfaces if you didn't want to.
I understand that Graph is an abstraction that many people do not need, particularly if they are streaming, but Statement seems to be a very useful abstraction in an object oriented language, and it should be very low cost to create, even if you are streaming.
As Andy says, both Sesame and Jena currently offer streaming parsers for both SPARQL Results formats and RDF formats, so your main argument right now seems to be possible in practice already. The choice is just not interchangable after you decide which library to use at this point, which is the reason that we stopped where we did so far as the current model is at least enough to get streaming parsers going.
All parts of the API are loosely linked at this point, with a clear theoretical model from RDF-1.1. Hence, you don't need to implement or use Graph if you just want a streaming API that accepts Statement or a combination of the available RDFTerm's.
think Statement seems like something that would be essential / useful - it's the smallest "functional" piece of RDF. (A use case where you want to iterate over parts of a Graph response that are in units smaller than triples seems weird to me - why not use a Select query then?, but anyway.) Whether Graph gets its own Class/API, or whether Statement could be a (potentially implicit) quad instead is probably where the different underlying libraries will have differing goals.
Regarding the goals of the library to have common abstractions / vocabulary - I would bet most people using RDF are also using (at least some) SPARQL. You can build a generic interface for querying and streaming through results that covers both Jena and Sesame, I have done so in Clojure anyway, in my KR library. This requires more than just agreeing that results are in terms of the common RDFTerm class though as pointed out above, a common SPARQL API is needed to agree to how tuples or graphs etc. are returned/iterated over etc. But it wasn't that hard to do. Having the underling library maintainers do it for me (possibly more efficiently) would have certainly been better. This goes beyond the scope of just defining core RDF terms though.
I think the Graph concept is useful - not everyone is accessing pre-existing data on a pre-existing SPARQL server. For instance, a light-weight container for annotations might want to expose just a couple of Graph instances without exposing the underlying RDF framework. Someone who is generating RDF as a side-product can chuck their triples in a Graph and then pass it to arbitrary RDF framework for serialization or going to a LOD server.
I can see many libraries that would not use Graph, but could use the other RDFTerms.
This would be the case for OWLAPI for instance, which has Ontology as a core concept rather than a graph. Operations like Graph.add() don't make much sense in general there, as you have to serialize the ontology as RDF before you get a graph.
I don't think it should be a requirement for implementors to provide a Graph implementation - thus RDFTermFactory.createGraph() is optional.