Apache Jena
  1. Apache Jena
  2. JENA-216

Official Turtle Test-18 does not parse

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: ARQ 2.9.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Java 6, OSX

      Description

      I am having trouble Trying to parse http://www.w3.org/TR/turtle/tests/test-18.ttl which contains the following two lines

      <http://example.org/foo#a> <http://example.org/foo#b> "\nthis \ris a \U00015678long\t\nliteral\uABCD\n" .
      <http://example.org/foo#d> <http://example.org/foo#e> "\tThis \uABCDis\r \U00015678another\n\none\n" .

      scala> import java.io._
      import java.io._

      scala> import com.hp.hpl.jena.rdf.model._
      import com.hp.hpl.jena.rdf.model._

      scala> val f = "/Volumes/Dev/Programming/w3.org/git/pimp-my-rdf/n3-test-suite/target/scala-2.9.1/classes/www.w3.org/TR/turtle/tests/test-18.out"
      f: java.lang.String = /Volumes/Dev/Programming/w3.org/git/pimp-my-rdf/n3-test-suite/target/scala-2.9.1/classes/www.w3.org/TR/turtle/tests/test-18.out

      scala> val in = new InputStreamReader(new BufferedInputStream(new FileInputStream(f)),"UTF-8")
      in: java.io.InputStreamReader = java.io.InputStreamReader@1e392427

      scala> val model = ModelFactory.createDefaultModel()
      model: com.hp.hpl.jena.rdf.model.Model = <ModelCom {} | >

      scala> model.read(in,"file:/"+f,"TTL")
      com.hp.hpl.jena.n3.turtle.TurtleParseException: Lexical error at line 1, column 71. Encountered: "U" (85), after : "\"
      nthis
      ris a
      "
      at com.hp.hpl.jena.n3.turtle.ParserTurtle.parse(ParserTurtle.java:56)
      at com.hp.hpl.jena.n3.turtle.TurtleReader.readWorker(TurtleReader.java:33)
      at com.hp.hpl.jena.n3.JenaReaderBase.readImpl(JenaReaderBase.java:119)
      at com.hp.hpl.jena.n3.JenaReaderBase.read(JenaReaderBase.java:49)
      at com.hp.hpl.jena.rdf.model.impl.ModelCom.read(ModelCom.java:261)

      or more directly

      scala> model.read("http://www.w3.org/TR/turtle/tests/test-18.ttl","TTL")
      com.hp.hpl.jena.n3.turtle.TurtleParseException: Lexical error at line 3, column 25. Encountered: "U" (85), after : "\"
      nthis
      ris a
      "
      at com.hp.hpl.jena.n3.turtle.ParserTurtle.parse(ParserTurtle.java:56)
      at com.hp.hpl.jena.n3.turtle.TurtleReader.readWorker(TurtleReader.java:33)
      at com.hp.hpl.jena.n3.JenaReaderBase.readImpl(JenaReaderBase.java:119)
      at com.hp.hpl.jena.n3.JenaReaderBase.read(JenaReaderBase.java:49)
      at com.hp.hpl.jena.n3.JenaReaderBase.read(JenaReaderBase.java:60)
      at com.hp.hpl.jena.rdf.model.impl.ModelCom.read(ModelCom.java:241)

      This is with the 2.9 release of Jena for December which I imported into my project with

      "org.apache.jena" % "jena-arq" % "2.9.0-incubating"

        Activity

        Hide
        Henry Story added a comment -

        my URLs to the test cases were wrong

        Show
        Henry Story added a comment - my URLs to the test cases were wrong
        Hide
        Andy Seaborne added a comment -

        (there aren't any "offical test yet :: RDF-WG has copied over old tests and is adding a few but nothing official yet.)

        test-18.ttl does parse in the new parsers. You're using the old ones because you haven't touched anything that causes them to wire themselves in. We're in transition.

        Add SysRIOT.wireIntoJena() or, easier, ARQ.init(), just use the command line tools;

        Note: there is a unicode code point beyond the basic plane in that data. It will not perfectly round trip in Java (or scala on the JVM) because java does not support such codepoints except as combining chars.

        java -cp ... arq.riot test-18.ttl =>
        <http://example.org/foo#a> <http://example.org/foo#b> "\nthis \ris a \uD815\uDE78long\t\nliteral\uABCD\n" .
        <http://example.org/foo#d> <http://example.org/foo#e> "\tThis \uABCDis\r \uD815\uDE78another\n\none\n" .

        Note: the \U is now two \u's.

        Show
        Andy Seaborne added a comment - (there aren't any "offical test yet :: RDF-WG has copied over old tests and is adding a few but nothing official yet.) test-18.ttl does parse in the new parsers. You're using the old ones because you haven't touched anything that causes them to wire themselves in. We're in transition. Add SysRIOT.wireIntoJena() or, easier, ARQ.init(), just use the command line tools; Note: there is a unicode code point beyond the basic plane in that data. It will not perfectly round trip in Java (or scala on the JVM) because java does not support such codepoints except as combining chars. java -cp ... arq.riot test-18.ttl => < http://example.org/foo#a > < http://example.org/foo#b > "\nthis \ris a \uD815\uDE78long\t\nliteral\uABCD\n" . < http://example.org/foo#d > < http://example.org/foo#e > "\tThis \uABCDis\r \uD815\uDE78another\n\none\n" . Note: the \U is now two \u's.
        Hide
        Andy Seaborne added a comment -

        By the way:

        Turtle test suite is here:
        https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/testing/RIOT/TurtleStd/

        which includes fixes

        • illegal chars like \n or \u0000 in URIs not expected to parse.
          The test assume no checking is done in the parser but RIOT does and so you get line numbers.
        • test-28.out in the Turtle test suite is (as you've already found out) just plain wrong.

        IIRC \U00015678 isn't a legal code point (i.e. not allocated) in all versions of Unicode. As of java6, I think it's now OK.

        Show
        Andy Seaborne added a comment - By the way: Turtle test suite is here: https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/testing/RIOT/TurtleStd/ which includes fixes illegal chars like \n or \u0000 in URIs not expected to parse. The test assume no checking is done in the parser but RIOT does and so you get line numbers. test-28.out in the Turtle test suite is (as you've already found out) just plain wrong. IIRC \U00015678 isn't a legal code point (i.e. not allocated) in all versions of Unicode. As of java6, I think it's now OK.
        Hide
        Henry Story added a comment -

        Ok, I fixed those two tests in alignement with yours

        https://github.com/betehess/pimp-my-rdf/commit/460d16ac3829dbd963e500c1367b5e45edf3428c

        Test-29 - the IRI test - I took the third option. Abera IRI barfs on the other two.

        Show
        Henry Story added a comment - Ok, I fixed those two tests in alignement with yours https://github.com/betehess/pimp-my-rdf/commit/460d16ac3829dbd963e500c1367b5e45edf3428c Test-29 - the IRI test - I took the third option. Abera IRI barfs on the other two.

          People

          • Assignee:
            Andy Seaborne
            Reporter:
            Henry Story
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development