Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Not a Problem
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ARQ, Jena, RDF/XML
    • Labels:
      None
    • Environment:

      2.6.4

      Description

      As I understand, initial and final white spaces in xsd:hexBinary in xml should be ignored

      http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#hexBinary

      because of the whitespace facet.

      With Jena 2.6.4 this is not the case, as shown by the test below.
      I found that in Clerezza when using the graph api, so this is a problem even when one does not use SPARQL.
      Removing the white space solves the proble.

      xsd:hexBinary is already a very fragile encoding. Making it this fragile is bound to lead to issues in communication.
      The same is true with the N3 encoding.

      -----------------------------------------------------------------
      hjs@bblfish[0]$ cat q1.sparql
      PREFIX : <http://me.example/p#>
      PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

      SELECT ?S WHERE

      { ?S :related "AAAA"^^xsd:hexBinary . }

      hjs@bblfish[0]$ cat c1.rdf

      <rdf:RDF xmlns="http://me.example/p#"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

      <rdf:Description rdf:about="http://me.example/p#me">
      <related rdf:datatype="http://www.w3.org/2001/XMLSchema#hexBinary">
      AAAA
      </related>
      </rdf:Description>
      </rdf:RDF>

      hjs@bblfish[0]$ arq --query=q1.sparql --data=c1.rdf


      S

      =====


        Activity

        Hide
        Henry Story added a comment - - edited

        It seems to me that xsd:hexBinary would make more sense to store directly as a blob, and do value matching on. This is because there is really not much personalisation that one can do with a hexBinary. The numbers have to follow each other without white space. There can only be white space at the beginning and end (if at all). I can see that this would be more of an issue with xsd:base64Binary since that literal can contain newlines in the middle of the number, and there people may want to keep formatting.

        Anyway, is it difficult to add what I now understand is called a D-entailment regime for xsd:hexBinary to Jena? Is this something I can do as a developer?
        http://www.w3.org/TR/sparql11-entailment/#DEntRegime

        I am interested in this not just for my own implementations, but also because we need to specify this behavior carefully in the WebID spec http://webid.info/spec#verifying-the-webid-claim . We are using SPARQL there to make it as easy to understand for developers and readers of the Spec as possible - a spec that we hope will bring a lot of people into the semantic web community.

        Show
        Henry Story added a comment - - edited It seems to me that xsd:hexBinary would make more sense to store directly as a blob, and do value matching on. This is because there is really not much personalisation that one can do with a hexBinary. The numbers have to follow each other without white space. There can only be white space at the beginning and end (if at all). I can see that this would be more of an issue with xsd:base64Binary since that literal can contain newlines in the middle of the number, and there people may want to keep formatting. Anyway, is it difficult to add what I now understand is called a D-entailment regime for xsd:hexBinary to Jena? Is this something I can do as a developer? http://www.w3.org/TR/sparql11-entailment/#DEntRegime I am interested in this not just for my own implementations, but also because we need to specify this behavior carefully in the WebID spec http://webid.info/spec#verifying-the-webid-claim . We are using SPARQL there to make it as easy to understand for developers and readers of the Spec as possible - a spec that we hope will bring a lot of people into the semantic web community.
        Hide
        Andy Seaborne added a comment -

        Yes (editted for future reader(s))

        Thx for pointing it out.

        Show
        Andy Seaborne added a comment - Yes (editted for future reader(s)) Thx for pointing it out.
        Hide
        Dennis E. Hamilton added a comment -

        @Andy Huh?

        xsd:base64Binary and xsd:base64Binary have a common value space but have no common type.

        Did you mean "xsd:hexBinary and xsd:base64Binary have a common value space but ..."

        Show
        Dennis E. Hamilton added a comment - @Andy Huh? xsd:base64Binary and xsd:base64Binary have a common value space but have no common type. Did you mean "xsd:hexBinary and xsd:base64Binary have a common value space but ..."
        Hide
        Andy Seaborne added a comment - - edited

        Note: xsd:base64Binary and xsd:hexBinary have a common value space but have no common type.

        Show
        Andy Seaborne added a comment - - edited Note: xsd:base64Binary and xsd:hexBinary have a common value space but have no common type.
        Hide
        Andy Seaborne added a comment -

        Henry - this JIRA was filed as a major bug but ARQ is behaving as per spec.

        "Not a problem" isn't an idea phrase (clearly, it's a problem for you) but it's not a bug.

        If you wish to submit a patch for ARQ, then we 'd be delighted to receive it.

        Show
        Andy Seaborne added a comment - Henry - this JIRA was filed as a major bug but ARQ is behaving as per spec. "Not a problem" isn't an idea phrase (clearly, it's a problem for you) but it's not a bug. If you wish to submit a patch for ARQ, then we 'd be delighted to receive it.
        Hide
        Henry Story added a comment -

        The example I gave was in terms of a SPARQL query, so it's a problem there.

        Show
        Henry Story added a comment - The example I gave was in terms of a SPARQL query, so it's a problem there.
        Hide
        Andy Seaborne added a comment -

        "Not a problem" = "not a bug"

        Show
        Andy Seaborne added a comment - "Not a problem" = "not a bug"
        Hide
        Henry Story added a comment -

        perhaps one could have that be settable.

        Our use case is the WebID protocol, of which a recent spec is up here http://www.w3.org/2005/Incubator/webid/spec/
        This protocol needs to fetch a document and do an ASK query match – this is the simplest way of explaining to implementers what needs to be done. If people can be failed to be logged in because of a space on either side of the number then I think that people will be justified in saying the SPARQL and RDF is too brittle to be used on the web...

        Now that an RDF engine does not process other datatypes that it does not know the semantics for, I understand. But hexBinary is one of those standard types.

        In the case of Clerezza which does not use SPARQL I'll fix that with the sameValueAs. But on the ReadWriteWeb project
        I am using SPARQL. https://dvcs.w3.org/hg/read-write-web/file/d1d551188b0f/src/main/scala/auth/WebIdClaim.scala
        It really helps make the case for how easy and powerful the semantic web is. So it would be nice if I could keep things that simple

        Show
        Henry Story added a comment - perhaps one could have that be settable. Our use case is the WebID protocol, of which a recent spec is up here http://www.w3.org/2005/Incubator/webid/spec/ This protocol needs to fetch a document and do an ASK query match – this is the simplest way of explaining to implementers what needs to be done. If people can be failed to be logged in because of a space on either side of the number then I think that people will be justified in saying the SPARQL and RDF is too brittle to be used on the web... Now that an RDF engine does not process other datatypes that it does not know the semantics for, I understand. But hexBinary is one of those standard types. In the case of Clerezza which does not use SPARQL I'll fix that with the sameValueAs. But on the ReadWriteWeb project I am using SPARQL. https://dvcs.w3.org/hg/read-write-web/file/d1d551188b0f/src/main/scala/auth/WebIdClaim.scala It really helps make the case for how easy and powerful the semantic web is. So it would be nice if I could keep things that simple
        Hide
        Andy Seaborne added a comment -

        The .sameValueAs() method tests value, .equals() tests identify of lexical form. You call the one you want.

        Presumably Clerezza is calling .equals() or it's a persistent storage layer. TDB doesn't handle xsd:hexBinary as a value-based type. It only handles numeric types, dates, dateTimes, Gregorian dates.

        And it is an RDF datatype.

        RDF datatypes are declared in RDF/XML with rdf:datatype="....." – an RDF mechanism, which is open. There isn't a fixed set of datatypes like XML Schema Datatypes.

        XML datatypes use external declaration or xsi:type. I believe that xsi:type can only refer to an XSD datatype.
        It only applies to XML.

        RDF isn't the XML document model and isn't necessarily in XML (c.f. Turtle). There may be other reasons the XML Schema datatype syntax was not applicable - I wasn't there at the time. Timing might be part of it - RDF finished Feb 2004, XQuery/Xpath data model is Jan 2007 with earliest candidate rec Nov 2005.

        SPARQL (and RDF by encouragement) uses the data model from XSD datatypes (lexical/value mapping), but not the syntax.

        Jena memory models do support a lot of value-based matching but this is costly. They support matching xsd:hexBinary by value if you call .sameValueAs; if you call .equals, you get strict equality. "001"^xsd:integer and "1":;xsd:integer are, at the lowest level of the RDF abstract data model different. It could be an RDF datatype that has never been met before – "IIII"my:roman and "IV"^my:roman.

        Users ask that reading in and writing out data does not change the format; the memory model keeps both forms around which is OK for numerics, but xsd:hexBinary can be large blobs, which is unfortunate.

        Canonicalization is a technique that emphasises the value at the expense of loosing different forms in different places in the data. A tradeoff.

        Jena persistent storage layers don't keep both value and lexical form about. Indexing does not work.

        Instead, TDB stores the value of numeric types, dates, dateTimes, Gregorian dates (in binary). It rebuilds nodes as their canonical form.
        TDB does not do anything special for xsd:hexBinary, typically used a blobs so does not do value-based matching, only lexical form matching.

        It could be added - users also want round-trip of layout.

        Show
        Andy Seaborne added a comment - The .sameValueAs() method tests value, .equals() tests identify of lexical form. You call the one you want. Presumably Clerezza is calling .equals() or it's a persistent storage layer. TDB doesn't handle xsd:hexBinary as a value-based type. It only handles numeric types, dates, dateTimes, Gregorian dates. And it is an RDF datatype. RDF datatypes are declared in RDF/XML with rdf:datatype="....." – an RDF mechanism, which is open. There isn't a fixed set of datatypes like XML Schema Datatypes. XML datatypes use external declaration or xsi:type. I believe that xsi:type can only refer to an XSD datatype. It only applies to XML. RDF isn't the XML document model and isn't necessarily in XML (c.f. Turtle). There may be other reasons the XML Schema datatype syntax was not applicable - I wasn't there at the time. Timing might be part of it - RDF finished Feb 2004, XQuery/Xpath data model is Jan 2007 with earliest candidate rec Nov 2005. SPARQL (and RDF by encouragement) uses the data model from XSD datatypes (lexical/value mapping), but not the syntax. Jena memory models do support a lot of value-based matching but this is costly. They support matching xsd:hexBinary by value if you call .sameValueAs; if you call .equals, you get strict equality. "001"^ xsd:integer and "1":;xsd:integer are, at the lowest level of the RDF abstract data model different. It could be an RDF datatype that has never been met before – "IIII" my:roman and "IV" ^my:roman. Users ask that reading in and writing out data does not change the format; the memory model keeps both forms around which is OK for numerics, but xsd:hexBinary can be large blobs, which is unfortunate. Canonicalization is a technique that emphasises the value at the expense of loosing different forms in different places in the data. A tradeoff. Jena persistent storage layers don't keep both value and lexical form about. Indexing does not work. Instead, TDB stores the value of numeric types, dates, dateTimes, Gregorian dates (in binary). It rebuilds nodes as their canonical form. TDB does not do anything special for xsd:hexBinary, typically used a blobs so does not do value-based matching, only lexical form matching. It could be added - users also want round-trip of layout.
        Hide
        Dennis E. Hamilton added a comment - - edited

        Wait a minute. The xsd Datatype is specified and then its lexical and nominal value rules are ignored?

        So it is not really the xsd Datatype, it is something that you call that but it is treated as if nothing more than xsd:string and handled literally? Oh my.

        So somewhere in RDF it says that the canonicalization of the data-typed literals is the user's responsibility?

        Show
        Dennis E. Hamilton added a comment - - edited Wait a minute. The xsd Datatype is specified and then its lexical and nominal value rules are ignored? So it is not really the xsd Datatype, it is something that you call that but it is treated as if nothing more than xsd:string and handled literally? Oh my. So somewhere in RDF it says that the canonicalization of the data-typed literals is the user's responsibility?
        Hide
        Andy Seaborne added a comment - - edited

        Henry -

        Node n1 = SSE.parseNode("'AA'^^xsd:hexBinary") ;
        Node n2 = SSE.parseNode("' AA '^^xsd:hexBinary") ;

        System.out.println(n1.equals(n2)) ; // ==> false
        System.out.println(n1.sameValueAs(n2)) ; // ==> true

        The same would be true for Literal.sameValueAs.

        You are right that xsd:hexBinary has the whitespace facet enabled (oddly, so does xsd:anyURI).

        Jena keeps the lexical for the literal as given, and in creating nodes, it does not modify the presented lexicial form (one eception rdf:XMLLiterals, because parseType="literal" requires XC14N to be applied). RDF/XML makes the lexical form of a literal to be the text of the XML element, which does not apply XSD rules. The RDF abstract syntax is agnostic to value processing (i.e. D-entailment).

        There is no XML scheme processing in parsing RDF/XML so no applying the whitespace facet.

        This case is the same as integers 0001 and 1. Same value but different lexical forms so different RDF literals. I guess Clerezza uses .equals not .sameValueAs.

        With the emergence of Turtle, this situation will be messier.

        We have talked about canonicalization of all input (see
        org.openjena.riot.pipeline.normalize.CanonicalizeLiteral). Whitespace processing could be included. Canonicalization is not free thiorugh

        But loosing the layout on large xsd:hexBinary/xsd:base64Binary might mean needing to teach the writers to layout these literals.

        The situation in ARQ is that in basic graph pattern matching, matching is by exact node equality (simple entailment unless you are using a reasoner).

        Filters however do value testing for certain well-known datatypes. ARQ adds various types over the minimum required by SPARQL - it adds the Gregorian dates (gYear, gMonthyDay etc), xsd:date, and XSD durations. It does not include xsd:hexBinary though.

        If the input data is canonicalized, the equality of node will be the same as value-equality.

        Show
        Andy Seaborne added a comment - - edited Henry - Node n1 = SSE.parseNode("'AA'^^xsd:hexBinary") ; Node n2 = SSE.parseNode("' AA '^^xsd:hexBinary") ; System.out.println(n1.equals(n2)) ; // ==> false System.out.println(n1.sameValueAs(n2)) ; // ==> true The same would be true for Literal.sameValueAs. You are right that xsd:hexBinary has the whitespace facet enabled (oddly, so does xsd:anyURI). Jena keeps the lexical for the literal as given, and in creating nodes, it does not modify the presented lexicial form (one eception rdf:XMLLiterals, because parseType="literal" requires XC14N to be applied). RDF/XML makes the lexical form of a literal to be the text of the XML element, which does not apply XSD rules. The RDF abstract syntax is agnostic to value processing (i.e. D-entailment). There is no XML scheme processing in parsing RDF/XML so no applying the whitespace facet. This case is the same as integers 0001 and 1. Same value but different lexical forms so different RDF literals. I guess Clerezza uses .equals not .sameValueAs. With the emergence of Turtle, this situation will be messier. We have talked about canonicalization of all input (see org.openjena.riot.pipeline.normalize.CanonicalizeLiteral). Whitespace processing could be included. Canonicalization is not free thiorugh But loosing the layout on large xsd:hexBinary/xsd:base64Binary might mean needing to teach the writers to layout these literals. The situation in ARQ is that in basic graph pattern matching, matching is by exact node equality (simple entailment unless you are using a reasoner). Filters however do value testing for certain well-known datatypes. ARQ adds various types over the minimum required by SPARQL - it adds the Gregorian dates (gYear, gMonthyDay etc), xsd:date, and XSD durations. It does not include xsd:hexBinary though. If the input data is canonicalized, the equality of node will be the same as value-equality.

          People

          • Assignee:
            Andy Seaborne
            Reporter:
            Henry Story
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development