Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.8.0
    • 1.0
    • core
    • None

    Description

      As a follow up to discussion [1].

      I've implemented another RDFa extractor for Any23 (0.7.1).
      Proposed code depends on semargl project [2].
      Pull request located at [3].

      [1] http://mail-archives.apache.org/mod_mbox/any23-dev/201212.mbox/browser
      [2] http://semarglproject.org
      [3] https://github.com/apache/any23/pull/2

      Attachments

        1. oQYfomKX.part
          4 kB
          Lewis John McGibbney
        2. rdfa-extractor-proposal.patch
          16 kB
          Lev Khomich

        Issue Links

          Activity

            levkhomich Lev Khomich created issue -
            levkhomich Lev Khomich made changes -
            Field Original Value New Value
            Attachment rdfa-extractor-proposal.patch [ 12561967 ]
            levkhomich Lev Khomich made changes -
            Link This issue depends on ANY23-136 [ ANY23-136 ]
            levkhomich Lev Khomich made changes -
            Link This issue is related to ANY23-128 [ ANY23-128 ]
            levkhomich Lev Khomich made changes -
            Link This issue is related to ANY23-135 [ ANY23-135 ]
            levkhomich Lev Khomich made changes -
            Link This issue is related to ANY23-100 [ ANY23-100 ]
            levkhomich Lev Khomich added a comment -

            As a temporary solution

            <repositories>
            <repository>
            <id>Semargl repository</id>
            <url>https://github.com/levkhomich/semargl/raw/master/maven-repo</url>
            </repository>
            </repositories>

            <dependencies>
            <dependency>
            <groupId>org.semarglproject</groupId>
            <artifactId>semargl-sesame</artifactId>
            <version>0.3</version>
            </dependency>
            </dependencies>

            levkhomich Lev Khomich added a comment - As a temporary solution <repositories> <repository> <id>Semargl repository</id> <url> https://github.com/levkhomich/semargl/raw/master/maven-repo </url> </repository> </repositories> <dependencies> <dependency> <groupId>org.semarglproject</groupId> <artifactId>semargl-sesame</artifactId> <version>0.3</version> </dependency> </dependencies>
            lewismc Lewis John McGibbney made changes -
            Fix Version/s 0.7.1 [ 12322500 ]

            Hi Lev. You've marked this issue as being related to a number of other issues. Can you please provide info on how many of these have been addressed within the scope of your proposal? Thank you.

            lewismc Lewis John McGibbney added a comment - Hi Lev. You've marked this issue as being related to a number of other issues. Can you please provide info on how many of these have been addressed within the scope of your proposal? Thank you.
            levkhomich Lev Khomich added a comment -

            Hi Lewis. It should fix all related issues and failed test cases at [1].

            Semargl v0.4 will be released in 1-2 weeks and will be available at maven central, so I can update proposal for further review.

            [1] http://rdfa.info/test-suite

            levkhomich Lev Khomich added a comment - Hi Lewis. It should fix all related issues and failed test cases at [1] . Semargl v0.4 will be released in 1-2 weeks and will be available at maven central, so I can update proposal for further review. [1] http://rdfa.info/test-suite

            Thanks, in the meantime I'll add the temp solution to test this one out. Thank you

            lewismc Lewis John McGibbney added a comment - Thanks, in the meantime I'll add the temp solution to test this one out. Thank you
            ansell Peter Ansell made changes -
            Fix Version/s 0.8.0 [ 12319885 ]
            Fix Version/s 0.7.1 [ 12322500 ]
            ansell Peter Ansell made changes -
            Affects Version/s 0.8.0 [ 12319885 ]
            Affects Version/s 0.7.1 [ 12322500 ]
            ansell Peter Ansell made changes -
            Link This issue is related to ANY23-65 [ ANY23-65 ]
            ansell Peter Ansell made changes -
            Link This issue is related to ANY23-69 [ ANY23-69 ]

            Hi Lev,
            I've also come across another issue with the existing html-rdfa11 Extractor implementation and have attached the file.
            For reference, here is the log report and output.

            <response><extractors><extractor>html-head-title</extractor><extractor>html-mf-hcard</extractor><extractor>html-mf-adr</extractor><extractor>html-rdfa11</extractor></extractors><report><message/><error/><issueReport><extractorIssues extractor="html-rdfa11"><issue level="Warning" row="202" col="30">Error while processing node [/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/P[1]/SPAN[1]/A[1]] : 'Cannot map prefix 'width''</issue><issue level="Warning" row="204" col="30">Error while processing node [/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/P[2]/SPAN[1]/A[1]] : 'Cannot map prefix 'width''</issue><issue level="Warning" row="208" col="30">Error while processing node [/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[2]/P[1]/SPAN[1]/A[1]] : 'Cannot map prefix 'width''</issue></extractorIssues></issueReport><validationReport><errors>
            </errors><ruleActivations>
            </ruleActivations><issues>
            </issues></validationReport></report><data>
            # OUTPUT FORMAT: Turtle (mimeTypes=text/turtle, application/x-turtle; ext=ttl)
            # BEGIN: ExtractionContext(urn:x-any23:html-mf-adr:root-extraction-result-id:http://stanford.edu/)
            @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
            @prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
            # BEGIN: ExtractionContext(urn:x-any23:html-mf-adr:1:http://stanford.edu/)
            # BEGIN: ExtractionContext(urn:x-any23:html-head-title:root-extraction-result-id:http://stanford.edu/)
            @prefix dcterms: <http://purl.org/dc/terms/> .
            
            <http://stanford.edu/> dcterms:title "Stanford University"@en .
            
            _:noded01df813432682e65b842257f3757e9 a vcard:Address ;
            	vcard:locality "450 Serra Mall, Stanford" ;
            	vcard:region "CA" ;
            	vcard:postal-code "94305" .
            # BEGIN: ExtractionContext(urn:x-any23:html-mf-hcard:root-extraction-result-id:http://stanford.edu/)
            
            _:node68324ba1f68fb1712ae267fe33274 vcard:fn "Stanford University" ;
            	vcard:n _:node17eprgndbx338343 .
            
            _:node17eprgndbx338343 a vcard:Name ;
            	vcard:given-name "Stanford" ;
            	vcard:family-name "University" .
            
            _:node68324ba1f68fb1712ae267fe33274 vcard:org _:node17eprgndbx338344 .
            
            _:node17eprgndbx338344 a vcard:Organization ;
            	vcard:organization-name "Stanford University" .
            
            _:node68324ba1f68fb1712ae267fe33274 vcard:adr _:noded01df813432682e65b842257f3757e9 ;
            	vcard:tel <tel:(650)%20723-2300> .
            # BEGIN: ExtractionContext(urn:x-any23:html-mf-hcard:1:http://stanford.edu/)
            
            _:node68324ba1f68fb1712ae267fe33274 a vcard:VCard .
            # BEGIN: ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:http://stanford.edu/)
            
            <http://stanford.edu/> <http://stanford.edu/alternate> <http://news.stanford.edu/rss/index.xml> .
            
            <http://stanford.edu/css/layout.css?v=3.0> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> .
            
            <http://stanford.edu/css/homepage.css?v=3.1> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> .
            
            <http://stanford.edu/css/jquery.fancybox.css?v=2.0.5> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> .
            
            <http://stanford.edu/css/mobile.css> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> .
            
            <https://fonts.googleapis.com/css?family=Source+Sans+Pro:300,400,600,700> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> .
            
            <https://fonts.googleapis.com/css?family=Crimson+Text:400,600,700> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> .
            # END: ExtractionContext(urn:x-any23:html-mf-adr:1:http://stanford.edu/)
            # END: ExtractionContext(urn:x-any23:html-mf-adr:root-extraction-result-id:http://stanford.edu/)
            # END: ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:http://stanford.edu/)
            # END: ExtractionContext(urn:x-any23:html-head-title:root-extraction-result-id:http://stanford.edu/)
            # END: ExtractionContext(urn:x-any23:html-mf-hcard:root-extraction-result-id:http://stanford.edu/)
            # END: ExtractionContext(urn:x-any23:html-mf-hcard:1:http://stanford.edu/)
            </data></response>
            
            lewismc Lewis John McGibbney added a comment - Hi Lev, I've also come across another issue with the existing html-rdfa11 Extractor implementation and have attached the file. For reference, here is the log report and output. <response><extractors><extractor>html-head-title</extractor><extractor>html-mf-hcard</extractor><extractor>html-mf-adr</extractor><extractor>html-rdfa11</extractor></extractors><report><message/><error/><issueReport><extractorIssues extractor= "html-rdfa11" ><issue level= "Warning" row= "202" col= "30" >Error while processing node [/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/P[1]/SPAN[1]/A[1]] : 'Cannot map prefix ' width ''</issue><issue level= "Warning" row= "204" col= "30" >Error while processing node [/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/P[2]/SPAN[1]/A[1]] : ' Cannot map prefix 'width' '</issue><issue level= "Warning" row= "208" col= "30" >Error while processing node [/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[2]/P[1]/SPAN[1]/A[1]] : ' Cannot map prefix 'width' '</issue></extractorIssues></issueReport><validationReport><errors> </errors><ruleActivations> </ruleActivations><issues> </issues></validationReport></report><data> # OUTPUT FORMAT: Turtle (mimeTypes=text/turtle, application/x-turtle; ext=ttl) # BEGIN: ExtractionContext(urn:x-any23:html-mf-adr:root-extraction-result-id:http: //stanford.edu/) @prefix rdf: <http: //www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix vcard: <http: //www.w3.org/2006/vcard/ns#> . # BEGIN: ExtractionContext(urn:x-any23:html-mf-adr:1:http: //stanford.edu/) # BEGIN: ExtractionContext(urn:x-any23:html-head-title:root-extraction-result-id:http: //stanford.edu/) @prefix dcterms: <http: //purl.org/dc/terms/> . <http: //stanford.edu/> dcterms:title "Stanford University" @en . _:noded01df813432682e65b842257f3757e9 a vcard:Address ; vcard:locality "450 Serra Mall, Stanford" ; vcard:region "CA" ; vcard:postal-code "94305" . # BEGIN: ExtractionContext(urn:x-any23:html-mf-hcard:root-extraction-result-id:http: //stanford.edu/) _:node68324ba1f68fb1712ae267fe33274 vcard:fn "Stanford University" ; vcard:n _:node17eprgndbx338343 . _:node17eprgndbx338343 a vcard:Name ; vcard:given-name "Stanford" ; vcard:family-name "University" . _:node68324ba1f68fb1712ae267fe33274 vcard:org _:node17eprgndbx338344 . _:node17eprgndbx338344 a vcard:Organization ; vcard:organization-name "Stanford University" . _:node68324ba1f68fb1712ae267fe33274 vcard:adr _:noded01df813432682e65b842257f3757e9 ; vcard:tel <tel:(650)%20723-2300> . # BEGIN: ExtractionContext(urn:x-any23:html-mf-hcard:1:http: //stanford.edu/) _:node68324ba1f68fb1712ae267fe33274 a vcard:VCard . # BEGIN: ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:http: //stanford.edu/) <http: //stanford.edu/> <http://stanford.edu/alternate> <http://news.stanford.edu/rss/index.xml> . <http: //stanford.edu/css/layout.css?v=3.0> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> . <http: //stanford.edu/css/homepage.css?v=3.1> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> . <http: //stanford.edu/css/jquery.fancybox.css?v=2.0.5> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> . <http: //stanford.edu/css/mobile.css> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> . <https: //fonts.googleapis.com/css?family=Source+Sans+Pro:300,400,600,700> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> . <https: //fonts.googleapis.com/css?family=Crimson+Text:400,600,700> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> . # END: ExtractionContext(urn:x-any23:html-mf-adr:1:http: //stanford.edu/) # END: ExtractionContext(urn:x-any23:html-mf-adr:root-extraction-result-id:http: //stanford.edu/) # END: ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:http: //stanford.edu/) # END: ExtractionContext(urn:x-any23:html-head-title:root-extraction-result-id:http: //stanford.edu/) # END: ExtractionContext(urn:x-any23:html-mf-hcard:root-extraction-result-id:http: //stanford.edu/) # END: ExtractionContext(urn:x-any23:html-mf-hcard:1:http: //stanford.edu/) </data></response>

            Attachment for potential additional bug (within current RDFa1.1) to be considered

            lewismc Lewis John McGibbney added a comment - Attachment for potential additional bug (within current RDFa1.1) to be considered
            lewismc Lewis John McGibbney made changes -
            Attachment oQYfomKX.part [ 12567060 ]
            levkhomich Lev Khomich added a comment -

            Semargl 0.4 was published to Maven Central.

            Therefore, dependency looks like this now:

            <dependencies>
            <dependency>
            <groupId>org.semarglproject</groupId>
            <artifactId>semargl-sesame</artifactId>
            <version>0.4</version>
            </dependency>
            </dependencies>

            Parsing can be further improved by using error-prone XMLReader specified by user.
            This feature is available for SesameRDFaParser in 0.5-SNAPSHOT version and
            (if required) can be backported to 0.4.

            levkhomich Lev Khomich added a comment - Semargl 0.4 was published to Maven Central. Therefore, dependency looks like this now: <dependencies> <dependency> <groupId>org.semarglproject</groupId> <artifactId>semargl-sesame</artifactId> <version>0.4</version> </dependency> </dependencies> Parsing can be further improved by using error-prone XMLReader specified by user. This feature is available for SesameRDFaParser in 0.5-SNAPSHOT version and (if required) can be backported to 0.4.
            ansell Peter Ansell added a comment -

            Sesame-2.7.0-beta2 was released last week, and I submitted a Pull Request [1] just now to Lev to take advantage of the two new features (RDFFormat.RDFA [2] and the new ParserConfig code [3]).

            After the Pull Request is accepted, the alternative XMLReader that Lev refers to will also be able to be set using the ParserConfig. This will avoid having to hardcode a link to the Semargl SesameRDFaParser and will enable us to keep using Rio.createParser. However, the setting that needs to be set will be in SemarglParserSettings, within semargl-sesame, for a little while (a few weeks, definitely before sesame-2.7.0-beta3 is released) until it is stabilised, at which point I would like to transfer it to sesame-rio-api so that there will be no compile time dependencies on semargl-sesame.

            [1] https://github.com/levkhomich/semargl/pull/16
            [2] https://bitbucket.org/openrdf/sesame/pull-request/47/ses-951-add-rdfa-constant-to-rdfformat
            [3] https://bitbucket.org/openrdf/sesame/pull-request/25/ses-1675-make-parserconfig-extensible

            ansell Peter Ansell added a comment - Sesame-2.7.0-beta2 was released last week, and I submitted a Pull Request [1] just now to Lev to take advantage of the two new features (RDFFormat.RDFA [2] and the new ParserConfig code [3] ). After the Pull Request is accepted, the alternative XMLReader that Lev refers to will also be able to be set using the ParserConfig. This will avoid having to hardcode a link to the Semargl SesameRDFaParser and will enable us to keep using Rio.createParser. However, the setting that needs to be set will be in SemarglParserSettings, within semargl-sesame, for a little while (a few weeks, definitely before sesame-2.7.0-beta3 is released) until it is stabilised, at which point I would like to transfer it to sesame-rio-api so that there will be no compile time dependencies on semargl-sesame. [1] https://github.com/levkhomich/semargl/pull/16 [2] https://bitbucket.org/openrdf/sesame/pull-request/47/ses-951-add-rdfa-constant-to-rdfformat [3] https://bitbucket.org/openrdf/sesame/pull-request/25/ses-1675-make-parserconfig-extensible
            lewismc Lewis John McGibbney made changes -
            Fix Version/s 0.9.0 [ 12319886 ]
            Fix Version/s 0.8.0 [ 12319885 ]
            gmcdonald Gavin McDonald made changes -
            Link This issue depends on ANY23-136 [ ANY23-136 ]
            gmcdonald Gavin McDonald made changes -
            Link This issue depends upon ANY23-136 [ ANY23-136 ]

            I think it would be great to push on with this RDFa proposal then make an interim release for Any23.
            Is anyone interested in working to get this done? This issue (with a few others) would IMO justify pushing an RC for 0.9.0.
            Any thought please?

            lewismc Lewis John McGibbney added a comment - I think it would be great to push on with this RDFa proposal then make an interim release for Any23. Is anyone interested in working to get this done? This issue (with a few others) would IMO justify pushing an RC for 0.9.0. Any thought please?
            ansell Peter Ansell made changes -
            Assignee Peter Ansell [ p_ansell ]
            ansell Peter Ansell made changes -
            Status Open [ 1 ] In Progress [ 3 ]

            Is there still any glue code that needs to be written to achieve this? or is ready for testing? Are there any instructions somewhere on how to build any23 with this patch so I can test it?

            scor Stephane Corlosquet added a comment - Is there still any glue code that needs to be written to achieve this? or is ready for testing? Are there any instructions somewhere on how to build any23 with this patch so I can test it?
            ansell Peter Ansell made changes -
            Status In Progress [ 3 ] Open [ 1 ]
            ansell Peter Ansell added a comment -

            I am waiting on Lev to either deploy Semargl-0.5 to Maven Central, or release and deploy a 0.6 release. The 0.4 release does not integrate easily into Sesame-2.7.5 as mentioned above. This was fixed in 0.5 so either solution is fine with me as far as I remember.

            https://github.com/levkhomich/semargl/issues/29

            ansell Peter Ansell added a comment - I am waiting on Lev to either deploy Semargl-0.5 to Maven Central, or release and deploy a 0.6 release. The 0.4 release does not integrate easily into Sesame-2.7.5 as mentioned above. This was fixed in 0.5 so either solution is fine with me as far as I remember. https://github.com/levkhomich/semargl/issues/29
            ansell Peter Ansell added a comment -

            The glue code should be trivial, as Semargl-Sesame provides an RDFParserFactory with a matching META-INF/services file, so it hooks in the same as the other standard Sesame Rio parsers.

            ansell Peter Ansell added a comment - The glue code should be trivial, as Semargl-Sesame provides an RDFParserFactory with a matching META-INF/services file, so it hooks in the same as the other standard Sesame Rio parsers.

            Great Peter. I will work on some other issues in an attempt to bring them in to the 0.8.1 release.

            lewismc Lewis John McGibbney added a comment - Great Peter. I will work on some other issues in an attempt to bring them in to the 0.8.1 release.
            ansell Peter Ansell added a comment -

            Lev released 0.6 over the weekend and I updated the RDFa parser factories in Any23 to use it (via RDFFormat.RDFA).

            There are some unit tests that are failing, so I haven't committed it to the master branch yet. Some are failing due to well-formedness exceptions, which may be that Semargl is more strict than our previous tag soup parser. One of them that I am interested in seems to be failing due to an error extracting CURIEs and mapping them to Sesame:

            RDFa11ExtractorTest>AbstractRDFaExtractorTestCase.testRDFa11CURIEs:77->AbstractExtractorTestCase.assertContains:244 Assertion failed! Extracted triples:
            <http://dbpedia.org/resource/Albert_Einstein> <http://dbpedia.org/name> "Albert Einstein" ;
            <http://dbpedia.org/knows> <http://dbpedia.org/resource/Franklin_Roosevlet> .

            <db:table/Departments> <db:description> "Tables listing departments" ;
            <http://xmlns.com/foaf/0.1/author> <db:people/Davide_Palmisano> ;
            <http://purl.org/dc/terms/name> "Departments" .
            Cannot find triple (http://database.org/table/Departments http://database.org/description "Tables listing departments")

            That error message seems to indicate that the internal Sesame repository did not receive the namespace declaration to map "db:" to "http://database.org/". That will need to be tested at the Semargl end of things, however, it may also be an error on our end if we are using a custom RDFHandler that doesn't react properly to RDFHandler.handleNamespace.

            The branch, named ANY23-137, with the parser factory conversion is available in the Apache Git repository and in my GitHub repository if you prefer to fetch it from there.

            ansell Peter Ansell added a comment - Lev released 0.6 over the weekend and I updated the RDFa parser factories in Any23 to use it (via RDFFormat.RDFA). There are some unit tests that are failing, so I haven't committed it to the master branch yet. Some are failing due to well-formedness exceptions, which may be that Semargl is more strict than our previous tag soup parser. One of them that I am interested in seems to be failing due to an error extracting CURIEs and mapping them to Sesame: RDFa11ExtractorTest>AbstractRDFaExtractorTestCase.testRDFa11CURIEs:77->AbstractExtractorTestCase.assertContains:244 Assertion failed! Extracted triples: < http://dbpedia.org/resource/Albert_Einstein > < http://dbpedia.org/name > "Albert Einstein" ; < http://dbpedia.org/knows > < http://dbpedia.org/resource/Franklin_Roosevlet > . <db:table/Departments> <db:description> "Tables listing departments" ; < http://xmlns.com/foaf/0.1/author > <db:people/Davide_Palmisano> ; < http://purl.org/dc/terms/name > "Departments" . Cannot find triple ( http://database.org/table/Departments http://database.org/description "Tables listing departments") That error message seems to indicate that the internal Sesame repository did not receive the namespace declaration to map "db:" to "http://database.org/". That will need to be tested at the Semargl end of things, however, it may also be an error on our end if we are using a custom RDFHandler that doesn't react properly to RDFHandler.handleNamespace. The branch, named ANY23-137 , with the parser factory conversion is available in the Apache Git repository and in my GitHub repository if you prefer to fetch it from there.
            lewismc Lewis John McGibbney made changes -
            Fix Version/s 1.0.0 [ 12319887 ]
            Fix Version/s 0.9.0 [ 12319886 ]
            lewismc Lewis John McGibbney made changes -
            Fix Version/s 0.9.0 [ 12319886 ]
            Fix Version/s 1.0.0 [ 12319887 ]

            Hi Peter. I can confirm everything you've said above. Strangely when I work with the ANY23-137 branch and do 'mvn test' locally, it all seems to work fine and no tests fail.
            Anyway, our nightly build was picking up the ANY23-137 branch and building it (and failing) so I can confirm everything that you've said above.
            I think that the well-formedness exceptions can be easily solved by removing the preceding <!doctype html> (or maybe it is the license headers) from test-resources html files.
            The remaining stuff we need to look into so I'll make this my priority. Would be great to get this into 0.9.0.

            [0] http://s.apache.org/5Ua

            lewismc Lewis John McGibbney added a comment - Hi Peter. I can confirm everything you've said above. Strangely when I work with the ANY23-137 branch and do 'mvn test' locally, it all seems to work fine and no tests fail. Anyway, our nightly build was picking up the ANY23-137 branch and building it (and failing) so I can confirm everything that you've said above. I think that the well-formedness exceptions can be easily solved by removing the preceding <!doctype html> (or maybe it is the license headers) from test-resources html files. The remaining stuff we need to look into so I'll make this my priority. Would be great to get this into 0.9.0. [0] http://s.apache.org/5Ua
            lewismc Lewis John McGibbney made changes -
            Fix Version/s 1.0.0 [ 12319887 ]
            Fix Version/s 0.9.0 [ 12319886 ]

            What's the status of this issue to merge semargl into Any23? Is there anything I can test to make progress?

            scor Stephane Corlosquet added a comment - What's the status of this issue to merge semargl into Any23? Is there anything I can test to make progress?

            I cloned https://github.com/apache/any23/, checked out branch ANY23-137, and rebased on master. I ran mvn clean install, and got:

            [INFO] Apache Any23 ...................................... SUCCESS [12.224s]
            [INFO] Apache Any23 :: Base API .......................... SUCCESS [2.735s]
            [INFO] Apache Any23 :: Test Resources .................... SUCCESS [1.953s]
            [INFO] Apache Any23 :: NQuads Parser and Writer .......... SUCCESS [2.180s]
            [INFO] Apache Any23 :: CSV Utilities ..................... SUCCESS [0.478s]
            [INFO] Apache Any23 :: Mime Type Detection ............... SUCCESS [3.126s]
            [INFO] Apache Any23 :: Encoding Detection ................ SUCCESS [1.310s]
            [INFO] Apache Any23 :: Core .............................. FAILURE [17.546s]
            [INFO] Apache Any23 :: Plugins :: Basic Crawler .......... SKIPPED
            [INFO] Apache Any23 :: Plugins :: HTML Scraper ........... SKIPPED
            [INFO] Apache Any23 :: Plugins :: Office Scraper ......... SKIPPED
            [INFO] Apache Any23 :: Plugins :: Integration Test ....... SKIPPED
            [INFO] Apache Any23 :: Service ........................... SKIPPED
            ...
            [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.16:test (default-test) on project apache-any23-core: There are test failures.
            

            mvn clean install on master works well locally.

            scor Stephane Corlosquet added a comment - I cloned https://github.com/apache/any23/ , checked out branch ANY23-137 , and rebased on master. I ran mvn clean install, and got: [INFO] Apache Any23 ...................................... SUCCESS [12.224s] [INFO] Apache Any23 :: Base API .......................... SUCCESS [2.735s] [INFO] Apache Any23 :: Test Resources .................... SUCCESS [1.953s] [INFO] Apache Any23 :: NQuads Parser and Writer .......... SUCCESS [2.180s] [INFO] Apache Any23 :: CSV Utilities ..................... SUCCESS [0.478s] [INFO] Apache Any23 :: Mime Type Detection ............... SUCCESS [3.126s] [INFO] Apache Any23 :: Encoding Detection ................ SUCCESS [1.310s] [INFO] Apache Any23 :: Core .............................. FAILURE [17.546s] [INFO] Apache Any23 :: Plugins :: Basic Crawler .......... SKIPPED [INFO] Apache Any23 :: Plugins :: HTML Scraper ........... SKIPPED [INFO] Apache Any23 :: Plugins :: Office Scraper ......... SKIPPED [INFO] Apache Any23 :: Plugins :: Integration Test ....... SKIPPED [INFO] Apache Any23 :: Service ........................... SKIPPED ... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.16:test ( default -test) on project apache-any23-core: There are test failures. mvn clean install on master works well locally.

            Hi scor, AFAIK this issue is not far from being suitable for integration in to the codebase. There is a ANY23-137 branch which you can checkout and work on. As p_ansell noted above, there are some test failures which need to be addressed before we can merge in to trunk. If you are able to work on some of these it would be excellent.

            lewismc Lewis John McGibbney added a comment - Hi scor , AFAIK this issue is not far from being suitable for integration in to the codebase. There is a ANY23-137 branch which you can checkout and work on. As p_ansell noted above, there are some test failures which need to be addressed before we can merge in to trunk. If you are able to work on some of these it would be excellent.
            levkhomich Lev Khomich added a comment -

            Hi!
            I've prepared pull request to fix RDFa related tests. See https://github.com/apache/any23/pull/2 . Are there any other things to do with this branch?

            levkhomich Lev Khomich added a comment - Hi! I've prepared pull request to fix RDFa related tests. See https://github.com/apache/any23/pull/2 . Are there any other things to do with this branch?

            levkhomich I've tried your branch and running "mvn clean install" is still giving me some errors (though they are different this time):

            Failed tests: 
              Any23Test.testExtractionParameters:347 Unexpected number of triples. expected:<6> but was:<9>
              RoverTest.testRunMultiURLs:103->runWithMultiSourcesAndVerify:118 Unexpected exit code. expected:<0> but was:<1>
              RoverTest.testRunMultiFiles:64->runWithMultiSourcesAndVerify:118 Unexpected exit code. expected:<0> but was:<1>
            
            Tests in error: 
              Any23Test.testImplicitEncoding:135->assertEncodingDetection:621 ? Extraction E...
              Any23Test.testMicrodataSupport:480->assertExtractorActivation:586->detectAndExtract:555 ? Extraction
              Any23Test.testExplicitEncoding:118->assertEncodingDetection:621 ? Extraction E...
              Any23Test.testProgrammaticExtraction:279 ? Extraction Error while parsing RDF ...
            
            Tests run: 389, Failures: 3, Errors: 4, Skipped: 9
            ...
            [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.16:test (default-test) on project apache-any23-core: There are test failures.
            
            scor Stephane Corlosquet added a comment - levkhomich I've tried your branch and running "mvn clean install" is still giving me some errors (though they are different this time): Failed tests: Any23Test.testExtractionParameters:347 Unexpected number of triples. expected:<6> but was:<9> RoverTest.testRunMultiURLs:103->runWithMultiSourcesAndVerify:118 Unexpected exit code. expected:<0> but was:<1> RoverTest.testRunMultiFiles:64->runWithMultiSourcesAndVerify:118 Unexpected exit code. expected:<0> but was:<1> Tests in error: Any23Test.testImplicitEncoding:135->assertEncodingDetection:621 ? Extraction E... Any23Test.testMicrodataSupport:480->assertExtractorActivation:586->detectAndExtract:555 ? Extraction Any23Test.testExplicitEncoding:118->assertEncodingDetection:621 ? Extraction E... Any23Test.testProgrammaticExtraction:279 ? Extraction Error while parsing RDF ... Tests run: 389, Failures: 3, Errors: 4, Skipped: 9 ... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.16:test ( default -test) on project apache-any23-core: There are test failures.

            I should also add that I'm running all the above on Mac OS X 10.8.5 with:

            $ java -version
            java version "1.6.0_65"
            Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609)
            Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode)
            

            and

            $ mvn --version
            Apache Maven 3.0.3 (r1075438; 2011-02-28 12:31:09-0500)
            Maven home: /usr/share/maven
            Java version: 1.6.0_65, vendor: Apple Inc.
            Java home: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
            Default locale: en_US, platform encoding: MacRoman
            OS name: "mac os x", version: "10.8.5", arch: "x86_64", family: "mac"
            
            scor Stephane Corlosquet added a comment - I should also add that I'm running all the above on Mac OS X 10.8.5 with: $ java -version java version "1.6.0_65" Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609) Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode) and $ mvn --version Apache Maven 3.0.3 (r1075438; 2011-02-28 12:31:09-0500) Maven home: /usr/share/maven Java version: 1.6.0_65, vendor: Apple Inc. Java home: / System /Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home Default locale: en_US, platform encoding: MacRoman OS name: "mac os x" , version: "10.8.5" , arch: "x86_64" , family: "mac"

            Success. I tested https://github.com/apache/any23/pull/2 on ubuntu 13.04 and it worked with java 1.7.0_51 and mvn 3.0.4! Not sure why it's not working on Mac OS X, do I need java 7?

            scor Stephane Corlosquet added a comment - Success. I tested https://github.com/apache/any23/pull/2 on ubuntu 13.04 and it worked with java 1.7.0_51 and mvn 3.0.4! Not sure why it's not working on Mac OS X, do I need java 7?
            levkhomich Lev Khomich added a comment - - edited

            Thanks, Stephane!

            Completely missed that RDFa was used as a part of extraction process in other tests.
            I've added related fixes.

            Brief description.

            ServletTest
            Old RDFa implementation produces
            <issue level="Warning" row="14" col="5">Error while processing node /HTML(1)/HEAD(1)/META(9) : 'Cannot map prefix 'fb''</issue>
            while <fb:app_id> is completely valid predicate which shouldn't be resolved against fb: prefix.

            Any23Test
            RoverTest
            Changed RDFXMLWriter to NTriplesWriter in some tests to improve precision (they basically check line count).
            Changed expected triples count. It was reduced in most cases, because old RDFa parsed produced a lot of invalid triples like:

            <http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/ambiente/> .
            <http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/salute/> .
            <http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/legalita/> .
            <http://host.com/service> <http://host.com/serviceexternal> <http://www.ansamed.info/> .
            <http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/web/notizie/regioni/lazio/provinciadiroma/> .

            Fixed markup in test-resources/src/test/resources/html/rdfa/ansa_2010-02-26_12645863.html to conform declared XHTML 1.0 Strict.
            Fixed RDFa markup in test-resources/src/test/resources/html/encoding-test.html otherwise it shouldn't produce any triples.
            Disabled second part of Any23Test.testExtractionParameters. Should it do anything after RDFa parser replacement?

            Also, ExtractionException thrown from BaseRDFExtractor is escalated in test suite. It leads to some failed tests in Any23Test. What's the correct behaviour for ANY23 parser in case it gets SAXException?

            levkhomich Lev Khomich added a comment - - edited Thanks, Stephane! Completely missed that RDFa was used as a part of extraction process in other tests. I've added related fixes. Brief description. ServletTest Old RDFa implementation produces <issue level="Warning" row="14" col="5">Error while processing node /HTML(1)/HEAD(1)/META(9) : 'Cannot map prefix 'fb''</issue> while <fb:app_id> is completely valid predicate which shouldn't be resolved against fb: prefix. Any23Test RoverTest Changed RDFXMLWriter to NTriplesWriter in some tests to improve precision (they basically check line count). Changed expected triples count. It was reduced in most cases, because old RDFa parsed produced a lot of invalid triples like: < http://host.com/service > < http://host.com/serviceexternal > < http://host.com/service/ambiente/ > . < http://host.com/service > < http://host.com/serviceexternal > < http://host.com/service/salute/ > . < http://host.com/service > < http://host.com/serviceexternal > < http://host.com/service/legalita/ > . < http://host.com/service > < http://host.com/serviceexternal > < http://www.ansamed.info/ > . < http://host.com/service > < http://host.com/serviceexternal > < http://host.com/service/web/notizie/regioni/lazio/provinciadiroma/ > . Fixed markup in test-resources/src/test/resources/html/rdfa/ansa_2010-02-26_12645863.html to conform declared XHTML 1.0 Strict. Fixed RDFa markup in test-resources/src/test/resources/html/encoding-test.html otherwise it shouldn't produce any triples. Disabled second part of Any23Test.testExtractionParameters . Should it do anything after RDFa parser replacement? Also, ExtractionException thrown from BaseRDFExtractor is escalated in test suite. It leads to some failed tests in Any23Test. What's the correct behaviour for ANY23 parser in case it gets SAXException?
            levkhomich Lev Khomich made changes -
            Description As a follow up to discussion [1].

            I've implemented another RDFa extractor for Any23 (0.7.1).
            Proposed code depends on semargl project [2]. It isn't published in maven
            central, therefore I didn't change any poms.
            Still not quite sure about class name (because related ones are already taken),
            feel free to rename it. See attachments for patch with extractor and tests.

            [1] http://mail-archives.apache.org/mod_mbox/any23-dev/201212.mbox/browser
            [2] http://semarglproject.org
            As a follow up to discussion [1].

            I've implemented another RDFa extractor for Any23 (0.7.1).
            Proposed code depends on semargl project [2].
            Pull request located at [3].

            [1] http://mail-archives.apache.org/mod_mbox/any23-dev/201212.mbox/browser
            [2] http://semarglproject.org
            [3] https://github.com/apache/any23/pull/2

            Hi Lev, I'll check this tomorrow and hopefully we can get it in to the codebase shortly. Thanks folks for keeping this issue alive and kicking

            lewismc Lewis John McGibbney added a comment - Hi Lev, I'll check this tomorrow and hopefully we can get it in to the codebase shortly. Thanks folks for keeping this issue alive and kicking

            I get as tests pass successfully.
            levkhomich, I'll check out both of your final comments tomorrow e.g.

            Should it do anything after RDFa parser replacement?

            and

            What's the correct behaviour for ANY23 parser in case it gets SAXException?

            Great work folks

            lewismc Lewis John McGibbney added a comment - I get as tests pass successfully. levkhomich , I'll check out both of your final comments tomorrow e.g. Should it do anything after RDFa parser replacement? and What's the correct behaviour for ANY23 parser in case it gets SAXException? Great work folks
            lewismc Lewis John McGibbney added a comment - - edited

            Regarding the 1st question above. It all looks good. The changes in

            Any23Test.testExtractionParameters

            look only to be aesthetic reformatting as oppose to functional.

            I do not think that there is any standard for catching SAXException. In the past (ANY23-115) for example when we discovered that empty spans break extraction of some documents, we decided to simply replace empty spans with a String "null". This way entire page parse and extraction is not lost/failed. I would be supportive of such measure if we occur when we encounter SAXException as well e.g. deal with it but do not fail the entire parse job.

            lewismc Lewis John McGibbney added a comment - - edited Regarding the 1st question above. It all looks good. The changes in Any23Test.testExtractionParameters look only to be aesthetic reformatting as oppose to functional. I do not think that there is any standard for catching SAXException. In the past ( ANY23-115 ) for example when we discovered that empty spans break extraction of some documents, we decided to simply replace empty spans with a String "null". This way entire page parse and extraction is not lost/failed. I would be supportive of such measure if we occur when we encounter SAXException as well e.g. deal with it but do not fail the entire parse job.

            commit c224e2658e6ac7eb1e9a3066dc0a24aeb9e5457f
            Merge: 7934f79 4ce8814
            Author: Lewis John McGibbney <lewis.j.mcgibbney@jpl.nasa.gov>
            Date: Thu May 8 18:59:33 2014 -0700

            ANY23-137 RDFa parser implementation proposal

            lewismc Lewis John McGibbney added a comment - commit c224e2658e6ac7eb1e9a3066dc0a24aeb9e5457f Merge: 7934f79 4ce8814 Author: Lewis John McGibbney <lewis.j.mcgibbney@jpl.nasa.gov> Date: Thu May 8 18:59:33 2014 -0700 ANY23-137 RDFa parser implementation proposal
            hudson Hudson added a comment -

            SUCCESS: Integrated in Any23-trunk #991 (See https://builds.apache.org/job/Any23-trunk/991/)
            ANY23-137 : Initial replacement of Any23 RDFA with Semargl (p_ansell: rev 9f60d3252fbd39cd6ea7670b43deeff0045d2b18)

            • pom.xml
            • core/src/test/java/org/apache/any23/extractor/rdfa/XSLTStylesheetTest.java
            • core/src/main/java/org/apache/any23/extractor/rdf/RDFParserFactory.java
            • core/pom.xml
            • core/src/main/java/org/apache/any23/filter/IgnoreAccidentalRDFa.java
            • core/src/main/java/org/apache/any23/extractor/rdfa/RDFaExtractor.java
            • core/src/test/java/org/apache/any23/Any23Test.java
            • core/src/main/java/org/apache/any23/extractor/rdfa/RDFa11Extractor.java
              ANY23-137 : Fix other test files that use incorrect prefix syntax (p_ansell: rev 5d6873ccabbf4e4666d7ae204dd18cef9df4a535)
            • test-resources/src/test/resources/html/rdfa/goodrelations-rdfa11.html
            • test-resources/src/test/resources/html/rdfa/rel-href.html
            hudson Hudson added a comment - SUCCESS: Integrated in Any23-trunk #991 (See https://builds.apache.org/job/Any23-trunk/991/ ) ANY23-137 : Initial replacement of Any23 RDFA with Semargl (p_ansell: rev 9f60d3252fbd39cd6ea7670b43deeff0045d2b18) pom.xml core/src/test/java/org/apache/any23/extractor/rdfa/XSLTStylesheetTest.java core/src/main/java/org/apache/any23/extractor/rdf/RDFParserFactory.java core/pom.xml core/src/main/java/org/apache/any23/filter/IgnoreAccidentalRDFa.java core/src/main/java/org/apache/any23/extractor/rdfa/RDFaExtractor.java core/src/test/java/org/apache/any23/Any23Test.java core/src/main/java/org/apache/any23/extractor/rdfa/RDFa11Extractor.java ANY23-137 : Fix other test files that use incorrect prefix syntax (p_ansell: rev 5d6873ccabbf4e4666d7ae204dd18cef9df4a535) test-resources/src/test/resources/html/rdfa/goodrelations-rdfa11.html test-resources/src/test/resources/html/rdfa/rel-href.html
            lewismc Lewis John McGibbney made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            lewismc Lewis John McGibbney made changes -
            Status Resolved [ 5 ] Closed [ 6 ]

            People

              ansell Peter Ansell
              levkhomich Lev Khomich
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: