Uploaded image for project: 'Apache Taverna'
  1. Apache Taverna
  2. TAVERNA-1044

Parsing COMBINE archive from JWSOnline skips metadata.rdf

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Done
    • Major
    • Resolution: Done
    • language 0.15.1
    • language 0.16.0
    • Taverna Language
    • None

    Description

      When parsing a COMBINE archive from JWS Online such as http://jjj.mib.ac.uk/models/experiments/adlung2017_fig2f/export/combinearchive?download=1 - then the metadata.rdf does not seem to be parsed.

      Error trace

      stain@biggie:/tmp$ curl -fO --remote-header-name 'http://jjj.mib.ac.uk/models/experiments/adlung2017_fig2f/export/combinearchive?download=1'
      curl: Saved to filename 'adlung2017_fig2f.sedx'
      
      stain@biggie:/tmp$ java -jar ~/software/taverna-tavlang-tool-0.15.1-incubating.jar convert --robundle adlung2017_fig2f.sedx 
      ..
      
      May 10, 2018 10:35:43 AM org.apache.taverna.robundle.manifest.combine.CombineManifest findAnnotations
      WARNING: Can't parse /metadata.rdf
      org.apache.jena.riot.RiotException: [line: 6, col: 43] {E202} Expecting XML start or end element(s). String data "2018-05-10T02:38:51Z" not allowed. Maybe there should be an rdf:parseType='Literal' for embedding mixed XML content in RDF. Maybe a striping error.
      	at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:128)
      	at org.apache.jena.riot.lang.LangRDFXML$ErrorHandlerBridge.error(LangRDFXML.java:246)
      	at org.apache.jena.rdfxml.xmlinput.impl.ARPSaxErrorHandler.error(ARPSaxErrorHandler.java:37)
      	at org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.warning(XMLHandler.java:196)
      	at org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.warning(XMLHandler.java:173)
      	at org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.warning(XMLHandler.java:168)
      	at org.apache.jena.rdfxml.xmlinput.impl.ParserSupport.warning(ParserSupport.java:194)
      	at org.apache.jena.rdfxml.xmlinput.states.Frame.warning(Frame.java:55)
      	at org.apache.jena.rdfxml.xmlinput.states.Frame.characters(Frame.java:164)
      	at org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.characters(XMLHandler.java:137)
      	at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown Source)
      	at org.apache.xerces.impl.XMLNamespaceBinder.characters(Unknown Source)
      	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
      	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
      	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
      	at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
      	at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
      	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
      	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
      	at org.apache.jena.rdfxml.xmlinput.impl.RDFXMLParser.parse(RDFXMLParser.java:150)
      	at org.apache.jena.rdfxml.xmlinput.ARP.load(ARP.java:118)
      	at org.apache.jena.riot.lang.LangRDFXML.parse(LangRDFXML.java:142)
      	at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:175)
      	at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:905)
      	at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:256)
      	at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:242)
      	at org.apache.taverna.robundle.manifest.combine.CombineManifest.parseRDF(CombineManifest.java:240)
      	at org.apache.taverna.robundle.manifest.combine.CombineManifest.findAnnotations(CombineManifest.java:332)
      	at org.apache.taverna.robundle.manifest.combine.CombineManifest.readCombineArchive(CombineManifest.java:465)
      	at org.apache.taverna.robundle.Bundle.readOrPopulateManifest(Bundle.java:121)
      	at org.apache.taverna.robundle.Bundle.getManifest(Bundle.java:87)
      	at org.apache.taverna.tavlang.tools.convert.ToRobundle.convert(ToRobundle.java:60)
      	at org.apache.taverna.tavlang.tools.convert.ToRobundle.<init>(ToRobundle.java:47)
      	at org.apache.taverna.tavlang.CommandLineTool$CommandConvert.runcommand(CommandLineTool.java:226)
      	at org.apache.taverna.tavlang.CommandLineTool$CommandConvert.execute(CommandLineTool.java:220)
      	at org.apache.taverna.tavlang.CommandLineTool.parse(CommandLineTool.java:71)
      	at org.apache.taverna.tavlang.TavernaCommandline.main(TavernaCommandline.java:26)
      
      

      Analysis

      This seems to be caused by invalid RDF/XML in the metadata.rdf added by JWS Online:

      stain@biggie:/tmp$ unzip adlung2017_fig2f.sedx
      
      stain@biggie:/tmp$ riot metadata.rdf 
      10:39:17 ERROR riot                 :: [line: 6, col: 43] {E202} Expecting XML start or end element(s). String data "2018-05-10T02:38:51Z" not allowed. Maybe there should be an rdf:parseType='Literal' for embedding mixed XML content in RDF. Maybe a striping error.
      10:39:17 ERROR riot                 :: [line: 43, col: 43] {E202} Expecting XML start or end element(s). String data "2018-05-10T02:38:51Z" not allowed. Maybe there should be an rdf:parseType='Literal' for embedding mixed XML content in RDF. Maybe a striping error.
      10:39:17 ERROR riot                 :: [line: 152, col: 43] {E202} Expecting XML start or end element(s). String data "2018-05-10T02:38:51Z" not allowed. Maybe there should be an rdf:parseType='Literal' for embedding mixed XML content in RDF. Maybe a striping error.
      ...
      <file:///tmp/> <http://purl.org/dc/terms/description> "Built by JWS Online." .
      _:B5145c9a4X2Df8feX2D4a36X2Daba1X2Dacab299dd7d7 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/dc/terms/W3CDTF> .
      <file:///tmp/> <http://purl.org/dc/terms/created> _:B5145c9a4X2Df8feX2D4a36X2Daba1X2Dacab299dd7d7 .
      <file:///tmp/models/adlung1.sbml> <http://purl.org/dc/terms/description> "Exported by JWS Online from ..."
      

      The broken RDF/XML follows this pattern:

        <rdf:Description rdf:about=".">
          <dcterms:description>Built by JWS Online.</dcterms:description>
          <dcterms:created>
            <dcterms:W3CDTF>2018-05-10T02:38:51Z</dcterms:W3CDTF>
          </dcterms:created>
        </rdf:Description>
      

      As Jena points out, this is not valid RDF/XML, as here it says a property dcterms:createdto a new anonymous W3CDTF resource - but a resource can't directly wrap a literal. The literal needs then a new nested property like <rdf:value>:

          <dcterms:created>
            <dcterms:W3CDTF>
              <rdf:value>2018-05-10T02:38:51Z</rdf:value>
            </dcterms:W3CDTF>
          </dcterms:created>
      

      This is probably a confusion from http://identifiers.org/combine.specifications/omex.version-1 which in its example, for some reason, uses dcterms:W3CDTF as a property of an untyped anonymous resource under dcterms:created:

      <dcterms:created rdf:parseType="Resource">
        <dcterms:W3CDTF>2014-06-26T10:29:00Z</rdf:value></dcterms:W3CDTF>
      </dcterms:created>
      

      This is semantically wrong as dcterms:W3CDTF is defined as a Datatype (like int), not a Property. Similarly dcterms:created is defined with a range rdfs:Literal, which would not include a new W3CDTF Resource.

      I believe dcterms:W3CDTF is meant as a grouping of the XSD datatypes like xsd:dateTime but is listed in DCTerms for pure XML users.

      dcterms:created is more commonly used with a typed RDF literal rather than through some kind of anonymous "timestamp" resource. So normal use (outside COMBINE) would be:

      <dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-06-26T10:29:00Z</dcterms:created>
      

      Our CombineManifest code supports both variants as the parseType=Resource variant is commonly used by COMBINE producers.

      The example from JWS Online however is in-between - I have let the authors know and recommended they use rdf:value or rdf:datatype variant. However the tavlang converter should then recognize rdf:value

      While it seems Jena's "riot" on the command line can ignore this syntactic error and parse the other triples, loading with Jena's RDFDataMgr.read() seems to bail out on the first error, meaning we also lose dcterms:creator which are correctly defined in the metadata.rdf.

      This bug is to investigate if it's possible to reduce this error to a warning, as well as add support for the rdf:value variant that we can recommend to JWSOnline instead of the semantically broken parseType="Resource" pattern.

      Attachments

        Activity

          People

            stain Stian Soiland-Reyes
            stain Stian Soiland-Reyes
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: