Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      A Semantic Web CMS like Clerezza is all going to be about fetching data from the web and using it to create interesting services. Fetching remote graphs should therefore be a simple and very reliable service. The service should act as a semantic web proxy/cache service. It should

      • be able to fetch a remote resource
      • return a local cached version if the remote resource has not been update
        (this implies it should understand the logic of HTTP etags, valid-until, and so on)
      • keep track of redirects
      • of which resources are information resources and which not (eg http://xmlns.com/foaf/0.1/knows is not an information resource, but a relation, and so redirects to the ontology)
      • allow the user to specify if he wants a clean version to be fetched remotely, or force the usage of local version
      • return a graph of that remote resource
      • also return a message if the resource does not exist, or is unavailable

      Longer term:

      • be able to return graphs for how resources were in the past
      • fetch graphs as a user - so that it can authenticate with WebID to remote resources and get additional information
      • know how to get GRDDL transforms to make any xml easily transformable into graphs

      In my latest 'mistaken' checkin ( r1081290 which should have been a development branch really, but it's easier to fix now than to undo) this role is taken by the org.apache.clerezza.platform.users.WebDescriptionProvider, as a large part of this was correctly done there by reto. So the proposal is that the proxy part of the WebDescriptionProvider should be moved to its own module, and that the WebDescriptionProvider should use that proxy service.

      This service will be needed for fetching web pages on the web. It should be built to be efficient and parallellisable. Perhaps Scala Actors are the right thing to user here (I am looking into this).

      Since this service should be useable by SSPs that need to use remote data, it should have a class containing a fetch() method that implements the WebRendering function https://issues.apache.org/jira/browse/CLEREZZA-356

        Activity

        Hide
        Reto Bachmann-Gmür added a comment -

        The resolution of CLEREZZA-525 refactored code from this issue to provide a web storage-provider. The storage.web bundle seems to cover mosts needs that this issue wanted to address.

        The patches of this issue do not only create a SemWebProxy bundle but also integrate it in many places. This should have better been done in separate issue after the WebProxy (the actual goal of this issue) is closed and accepted. This various dependencies on rdf.web proxy have been removed with the resolution of CLEREZZA-531.

        Show
        Reto Bachmann-Gmür added a comment - The resolution of CLEREZZA-525 refactored code from this issue to provide a web storage-provider. The storage.web bundle seems to cover mosts needs that this issue wanted to address. The patches of this issue do not only create a SemWebProxy bundle but also integrate it in many places. This should have better been done in separate issue after the WebProxy (the actual goal of this issue) is closed and accepted. This various dependencies on rdf.web proxy have been removed with the resolution of CLEREZZA-531 .
        Hide
        Reto Bachmann-Gmür added a comment -

        Reopening because of:

        • A bundle with artifact id starting with rdf must not access platform services, the SCB project with artifact id rdf.* does not require the platform, if the WebProxy would require the platform its artifact id should start with platform.*
        • For most applications it should be transparent if a graph is local or retrieved from the web, thus the WebProxy should implement WeightedTcProvider such an implementation has been committed in CLEREZZA-525 - For such an implementation to be able to rely on the other TcProvider for actual storage of the graph the cache graph must have a different name than the cached graph
        • There's an unclear mixture between resources and graphs, getting resource description is orthogonal to getting graphs
        • It's out of the scope of this issue to provide a WebRenderingSevice, if Renderlets should have (limited) access to graphs and proxies this should be done by an issue creating a platform.<something> artifact

        Concrete suggestion:

        • This issue should be renamed to "Make WebIdService use proxy" and depend on CLEREZZA-525 (which for pedantic reasons should be renamed as well)
        • The module rdf.web.proxy.core should be removed or maybe renamed and rescoped to be a resource (i.e. GraphNode) returning resource-description service.
        Show
        Reto Bachmann-Gmür added a comment - Reopening because of: A bundle with artifact id starting with rdf must not access platform services, the SCB project with artifact id rdf.* does not require the platform, if the WebProxy would require the platform its artifact id should start with platform.* For most applications it should be transparent if a graph is local or retrieved from the web, thus the WebProxy should implement WeightedTcProvider such an implementation has been committed in CLEREZZA-525 - For such an implementation to be able to rely on the other TcProvider for actual storage of the graph the cache graph must have a different name than the cached graph There's an unclear mixture between resources and graphs, getting resource description is orthogonal to getting graphs It's out of the scope of this issue to provide a WebRenderingSevice, if Renderlets should have (limited) access to graphs and proxies this should be done by an issue creating a platform.<something> artifact Concrete suggestion: This issue should be renamed to "Make WebIdService use proxy" and depend on CLEREZZA-525 (which for pedantic reasons should be renamed as well) The module rdf.web.proxy.core should be removed or maybe renamed and rescoped to be a resource (i.e. GraphNode) returning resource-description service.
        Hide
        Henry Story added a comment -

        the issue was resolved. Closed it as this issue will otherwise remain open forever. Not that all the points have been completed, but that the proxy is useful as it is and new issues for improvements can be placed in seperate issues.

        Show
        Henry Story added a comment - the issue was resolved. Closed it as this issue will otherwise remain open forever. Not that all the points have been completed, but that the proxy is useful as it is and new issues for improvements can be placed in seperate issues.
        Hide
        Henry Story added a comment -

        WebProxy works and is useful.

        Show
        Henry Story added a comment - WebProxy works and is useful.
        Hide
        Henry Story added a comment -

        the main part of this issue has been solved. Clearly the WebProxy bundle has a lot of evolving it can do. It works as is now, well enough to be used on small demo projects which is what is for. The remaining issue here is the CLEREZZA-489 Naming of Graphs, which is an important issue in itself, but should not be tied to the web proxy bundle.

        Show
        Henry Story added a comment - the main part of this issue has been solved. Clearly the WebProxy bundle has a lot of evolving it can do. It works as is now, well enough to be used on small demo projects which is what is for. The remaining issue here is the CLEREZZA-489 Naming of Graphs, which is an important issue in itself, but should not be tied to the web proxy bundle.
        Hide
        Henry Story added a comment -

        " the local caches I think should be named differently from the remote graph."

        ideally local graphs should have relative URIs. Then the coder would not need to know what the local deployment hostname is when developing his code. This is similar to how rdf/xml allows one to write relative URIs in the XML, or how one can use relative URLs in N3, or indeed in HTML. These allow the editor to see how his documents link together on the file system before someone publishes them on a server. In fact that is also how JSR311 works. If this is found to be problematic with current instantiations of RDF, then at least the API could hide whatever other system is used to get around limitations in RDF stores such as adding http://zz.localhost before a relative URI.

        But in a world where local graphs had relative URIs only things would be locally good, but it would not allow that good a communication with the external world, because the local system would not know when foreign servers were speaking about it. To be aware of global communication the local system does need to know about where foreign documents are linking to the local documents. Though this is not such a big deal either.

        Remote graphs as I mentioned are easiest named after the remote resource from which they come, especially in a graph database. One could decide that every graph stored locally is just a temporary representation of a remote graph, which would be useful for temporal reasoning - to keep track of changes to versions of a resource for example. To push that logic further one may want to distinguish in a multi user system between graphs for the same resource when requested by different people, as explained in great detail in CLEREZZA-490. It is clear then that naming a remote graph by the URL of the resource of that remote graph is not the final solution. But unless we have a function from (user, time, URI) -> graph then giving the resource the name of the graph is certainly the easiest, as it makes reading the database a lot easier: one just needs to look at the name of a graph to know where it came from.

        The fully correct system for remote graphs would be something like this in n3

        :g202323 a :Graph;
        =

        { ... };
        :fetchedFrom <https://remote.com/>;
        :fetchedBy <http://bblfish.net/people/henry/card#me>;
        :representation <file:/tmp/repr/202323>;
        :httpMeta [ etag "sdfsdfsddfs";
        validTo "2012...."^^xsd:dateTime;
        ... redirected info?
        ] .

        :g202324 a :Graph;
        = { ... }

        ;
        :fetchedFrom <https://remote.com/>;
        :fetchedBy <http://farewellutopia.com/reto#me>;
        :representation <file:/tmp/repr/202324>;
        :httpMeta [ etag "ddfsdfsddfd";
        validTo "2012...."^^xsd:dateTime;
        ... redirected info?
        ] .

        This would allow the proxy to:

        • be much more useful in debugging, say when a remote document is broken it can help the user see where that is
        • know when to fetch new remote representations
        • know to distinguish representations sent to different users

        It is arguable that the better the remote systems are written, the more RESTful they are, the more the name of the remote graph can be the name of the remote resource, since they will be names of unchanging entities. Given that we could start with naming the remote graphs the RESTful way, as that is easiest and most likely to force us to be RESTful ourselves.

        Show
        Henry Story added a comment - " the local caches I think should be named differently from the remote graph." ideally local graphs should have relative URIs. Then the coder would not need to know what the local deployment hostname is when developing his code. This is similar to how rdf/xml allows one to write relative URIs in the XML, or how one can use relative URLs in N3, or indeed in HTML. These allow the editor to see how his documents link together on the file system before someone publishes them on a server. In fact that is also how JSR311 works. If this is found to be problematic with current instantiations of RDF, then at least the API could hide whatever other system is used to get around limitations in RDF stores such as adding http://zz.localhost before a relative URI. But in a world where local graphs had relative URIs only things would be locally good, but it would not allow that good a communication with the external world, because the local system would not know when foreign servers were speaking about it. To be aware of global communication the local system does need to know about where foreign documents are linking to the local documents. Though this is not such a big deal either. Remote graphs as I mentioned are easiest named after the remote resource from which they come, especially in a graph database. One could decide that every graph stored locally is just a temporary representation of a remote graph, which would be useful for temporal reasoning - to keep track of changes to versions of a resource for example. To push that logic further one may want to distinguish in a multi user system between graphs for the same resource when requested by different people, as explained in great detail in CLEREZZA-490 . It is clear then that naming a remote graph by the URL of the resource of that remote graph is not the final solution. But unless we have a function from (user, time, URI) -> graph then giving the resource the name of the graph is certainly the easiest, as it makes reading the database a lot easier: one just needs to look at the name of a graph to know where it came from. The fully correct system for remote graphs would be something like this in n3 :g202323 a :Graph; = { ... }; :fetchedFrom < https://remote.com/ >; :fetchedBy < http://bblfish.net/people/henry/card#me >; :representation < file:/tmp/repr/202323 >; :httpMeta [ etag "sdfsdfsddfs"; validTo "2012...."^^xsd:dateTime; ... redirected info? ] . :g202324 a :Graph; = { ... } ; :fetchedFrom < https://remote.com/ >; :fetchedBy < http://farewellutopia.com/reto#me >; :representation < file:/tmp/repr/202324 >; :httpMeta [ etag "ddfsdfsddfd"; validTo "2012...."^^xsd:dateTime; ... redirected info? ] . This would allow the proxy to: be much more useful in debugging, say when a remote document is broken it can help the user see where that is know when to fetch new remote representations know to distinguish representations sent to different users It is arguable that the better the remote systems are written, the more RESTful they are, the more the name of the remote graph can be the name of the remote resource, since they will be names of unchanging entities. Given that we could start with naming the remote graphs the RESTful way, as that is easiest and most likely to force us to be RESTful ourselves.
        Hide
        Reto Bachmann-Gmür added a comment -

        I think things should be closer to existing apis. I suggest to extract

        getTriples(UriRef name) : TripleCollection

        and maybe
        listTripleCollections: List<TripleCollection>

        from TcManager/TcProvider to a new interface.

        This issue should create a new service (WebProxy) that implements that interface. The WebProxy should use TcManager to store local caches, the local caches I think should be named differently from the remote graph.

        Show
        Reto Bachmann-Gmür added a comment - I think things should be closer to existing apis. I suggest to extract getTriples(UriRef name) : TripleCollection and maybe listTripleCollections: List<TripleCollection> from TcManager/TcProvider to a new interface. This issue should create a new service (WebProxy) that implements that interface. The WebProxy should use TcManager to store local caches, the local caches I think should be named differently from the remote graph.
        Hide
        Henry Story added a comment -

        Lots of improvements possible, but it works

        Show
        Henry Story added a comment - Lots of improvements possible, but it works
        Hide
        Henry Story added a comment -

        Yes. I'll leave the big optimization pieces out of this as much as possible. The really important optimization we want is the one described here:

        http://metacircular.wordpress.com/2007/02/07/towards-polite-http-retrieval-in-scala/

        That is we should remember what the e-tag and the valid-till date was when downloading stuff, so that we can do conditional GETs.
        It should also be possible to download multiple URLs in parallel.

        Perhaps I should open a new issue called "optimize the SemWebProxy bundle"?

        [1] http://fmpwizard-scala.posterous.com/using-apache-httpclient-authentication-in-sca

        Show
        Henry Story added a comment - Yes. I'll leave the big optimization pieces out of this as much as possible. The really important optimization we want is the one described here: http://metacircular.wordpress.com/2007/02/07/towards-polite-http-retrieval-in-scala/ That is we should remember what the e-tag and the valid-till date was when downloading stuff, so that we can do conditional GETs. It should also be possible to download multiple URLs in parallel. Perhaps I should open a new issue called "optimize the SemWebProxy bundle"? [1] http://fmpwizard-scala.posterous.com/using-apache-httpclient-authentication-in-sca
        Hide
        Reto Bachmann-Gmür added a comment -

        I'd suggest tro leave optimization issues out of this issue and have only the extraction of what is currently in platform issues and a nice interface with blocking and non-blocking methods as part of this issue.

        Show
        Reto Bachmann-Gmür added a comment - I'd suggest tro leave optimization issues out of this issue and have only the extraction of what is currently in platform issues and a nice interface with blocking and non-blocking methods as part of this issue.
        Hide
        Henry Story added a comment -

        To have a system that can scale well one needs a good library to fetch files. I am looking at RESTlets for this

        http://wiki.restlet.org/docs_2.1/13-restlet/27-restlet/325-restlet/37-restlet.html

        Which can work with the Apache HTTP client library which is connectionless. If it is connectionaless perhaps it can be built into a Scala Actor framework.

        Show
        Henry Story added a comment - To have a system that can scale well one needs a good library to fetch files. I am looking at RESTlets for this http://wiki.restlet.org/docs_2.1/13-restlet/27-restlet/325-restlet/37-restlet.html Which can work with the Apache HTTP client library which is connectionless. If it is connectionaless perhaps it can be built into a Scala Actor framework.

          People

          • Assignee:
            Unassigned
            Reporter:
            Henry Story
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 168h
              168h
              Remaining:
              Remaining Estimate - 168h
              168h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development