Description
As a follow up to discussion [1].
I've implemented another RDFa extractor for Any23 (0.7.1).
Proposed code depends on semargl project [2].
Pull request located at [3].
[1] http://mail-archives.apache.org/mod_mbox/any23-dev/201212.mbox/browser
[2] http://semarglproject.org
[3] https://github.com/apache/any23/pull/2
Attachments
Attachments
- oQYfomKX.part
- 4 kB
- Lewis John McGibbney
- rdfa-extractor-proposal.patch
- 16 kB
- Lev Khomich
Issue Links
- depends upon
-
ANY23-136 Some RDFa tests have incorrect expected results
- Closed
- is related to
-
ANY23-100 Issue with RDFa extractor while processing nested properties
- Closed
-
ANY23-135 Any23 RDFa Extractor ignores multiple prefix and property statements
- Closed
-
ANY23-65 Update to RDFa extraction stylesheet
- Closed
-
ANY23-128 html-rdfa11 extractor fails on mailto: anchors
- Closed
-
ANY23-69 Create Any23 EARL report for RDFa 1.1 Processor Conformance test suite
- Open
Activity
Hi Lev. You've marked this issue as being related to a number of other issues. Can you please provide info on how many of these have been addressed within the scope of your proposal? Thank you.
Hi Lewis. It should fix all related issues and failed test cases at [1].
Semargl v0.4 will be released in 1-2 weeks and will be available at maven central, so I can update proposal for further review.
Thanks, in the meantime I'll add the temp solution to test this one out. Thank you
Hi Lev,
I've also come across another issue with the existing html-rdfa11 Extractor implementation and have attached the file.
For reference, here is the log report and output.
<response><extractors><extractor>html-head-title</extractor><extractor>html-mf-hcard</extractor><extractor>html-mf-adr</extractor><extractor>html-rdfa11</extractor></extractors><report><message/><error/><issueReport><extractorIssues extractor="html-rdfa11"><issue level="Warning" row="202" col="30">Error while processing node [/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/P[1]/SPAN[1]/A[1]] : 'Cannot map prefix 'width''</issue><issue level="Warning" row="204" col="30">Error while processing node [/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/P[2]/SPAN[1]/A[1]] : 'Cannot map prefix 'width''</issue><issue level="Warning" row="208" col="30">Error while processing node [/HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[2]/P[1]/SPAN[1]/A[1]] : 'Cannot map prefix 'width''</issue></extractorIssues></issueReport><validationReport><errors> </errors><ruleActivations> </ruleActivations><issues> </issues></validationReport></report><data> # OUTPUT FORMAT: Turtle (mimeTypes=text/turtle, application/x-turtle; ext=ttl) # BEGIN: ExtractionContext(urn:x-any23:html-mf-adr:root-extraction-result-id:http://stanford.edu/) @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix vcard: <http://www.w3.org/2006/vcard/ns#> . # BEGIN: ExtractionContext(urn:x-any23:html-mf-adr:1:http://stanford.edu/) # BEGIN: ExtractionContext(urn:x-any23:html-head-title:root-extraction-result-id:http://stanford.edu/) @prefix dcterms: <http://purl.org/dc/terms/> . <http://stanford.edu/> dcterms:title "Stanford University"@en . _:noded01df813432682e65b842257f3757e9 a vcard:Address ; vcard:locality "450 Serra Mall, Stanford" ; vcard:region "CA" ; vcard:postal-code "94305" . # BEGIN: ExtractionContext(urn:x-any23:html-mf-hcard:root-extraction-result-id:http://stanford.edu/) _:node68324ba1f68fb1712ae267fe33274 vcard:fn "Stanford University" ; vcard:n _:node17eprgndbx338343 . _:node17eprgndbx338343 a vcard:Name ; vcard:given-name "Stanford" ; vcard:family-name "University" . _:node68324ba1f68fb1712ae267fe33274 vcard:org _:node17eprgndbx338344 . _:node17eprgndbx338344 a vcard:Organization ; vcard:organization-name "Stanford University" . _:node68324ba1f68fb1712ae267fe33274 vcard:adr _:noded01df813432682e65b842257f3757e9 ; vcard:tel <tel:(650)%20723-2300> . # BEGIN: ExtractionContext(urn:x-any23:html-mf-hcard:1:http://stanford.edu/) _:node68324ba1f68fb1712ae267fe33274 a vcard:VCard . # BEGIN: ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:http://stanford.edu/) <http://stanford.edu/> <http://stanford.edu/alternate> <http://news.stanford.edu/rss/index.xml> . <http://stanford.edu/css/layout.css?v=3.0> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> . <http://stanford.edu/css/homepage.css?v=3.1> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> . <http://stanford.edu/css/jquery.fancybox.css?v=2.0.5> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> . <http://stanford.edu/css/mobile.css> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> . <https://fonts.googleapis.com/css?family=Source+Sans+Pro:300,400,600,700> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> . <https://fonts.googleapis.com/css?family=Crimson+Text:400,600,700> <http://stanford.edu/stylesheet> <http://news.stanford.edu/rss/index.xml> . # END: ExtractionContext(urn:x-any23:html-mf-adr:1:http://stanford.edu/) # END: ExtractionContext(urn:x-any23:html-mf-adr:root-extraction-result-id:http://stanford.edu/) # END: ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:http://stanford.edu/) # END: ExtractionContext(urn:x-any23:html-head-title:root-extraction-result-id:http://stanford.edu/) # END: ExtractionContext(urn:x-any23:html-mf-hcard:root-extraction-result-id:http://stanford.edu/) # END: ExtractionContext(urn:x-any23:html-mf-hcard:1:http://stanford.edu/) </data></response>
Attachment for potential additional bug (within current RDFa1.1) to be considered
Semargl 0.4 was published to Maven Central.
Therefore, dependency looks like this now:
<dependencies>
<dependency>
<groupId>org.semarglproject</groupId>
<artifactId>semargl-sesame</artifactId>
<version>0.4</version>
</dependency>
</dependencies>
Parsing can be further improved by using error-prone XMLReader specified by user.
This feature is available for SesameRDFaParser in 0.5-SNAPSHOT version and
(if required) can be backported to 0.4.
Sesame-2.7.0-beta2 was released last week, and I submitted a Pull Request [1] just now to Lev to take advantage of the two new features (RDFFormat.RDFA [2] and the new ParserConfig code [3]).
After the Pull Request is accepted, the alternative XMLReader that Lev refers to will also be able to be set using the ParserConfig. This will avoid having to hardcode a link to the Semargl SesameRDFaParser and will enable us to keep using Rio.createParser. However, the setting that needs to be set will be in SemarglParserSettings, within semargl-sesame, for a little while (a few weeks, definitely before sesame-2.7.0-beta3 is released) until it is stabilised, at which point I would like to transfer it to sesame-rio-api so that there will be no compile time dependencies on semargl-sesame.
[1] https://github.com/levkhomich/semargl/pull/16
[2] https://bitbucket.org/openrdf/sesame/pull-request/47/ses-951-add-rdfa-constant-to-rdfformat
[3] https://bitbucket.org/openrdf/sesame/pull-request/25/ses-1675-make-parserconfig-extensible
I think it would be great to push on with this RDFa proposal then make an interim release for Any23.
Is anyone interested in working to get this done? This issue (with a few others) would IMO justify pushing an RC for 0.9.0.
Any thought please?
Is there still any glue code that needs to be written to achieve this? or is ready for testing? Are there any instructions somewhere on how to build any23 with this patch so I can test it?
I am waiting on Lev to either deploy Semargl-0.5 to Maven Central, or release and deploy a 0.6 release. The 0.4 release does not integrate easily into Sesame-2.7.5 as mentioned above. This was fixed in 0.5 so either solution is fine with me as far as I remember.
The glue code should be trivial, as Semargl-Sesame provides an RDFParserFactory with a matching META-INF/services file, so it hooks in the same as the other standard Sesame Rio parsers.
Great Peter. I will work on some other issues in an attempt to bring them in to the 0.8.1 release.
Lev released 0.6 over the weekend and I updated the RDFa parser factories in Any23 to use it (via RDFFormat.RDFA).
There are some unit tests that are failing, so I haven't committed it to the master branch yet. Some are failing due to well-formedness exceptions, which may be that Semargl is more strict than our previous tag soup parser. One of them that I am interested in seems to be failing due to an error extracting CURIEs and mapping them to Sesame:
RDFa11ExtractorTest>AbstractRDFaExtractorTestCase.testRDFa11CURIEs:77->AbstractExtractorTestCase.assertContains:244 Assertion failed! Extracted triples:
<http://dbpedia.org/resource/Albert_Einstein> <http://dbpedia.org/name> "Albert Einstein" ;
<http://dbpedia.org/knows> <http://dbpedia.org/resource/Franklin_Roosevlet> .
<db:table/Departments> <db:description> "Tables listing departments" ;
<http://xmlns.com/foaf/0.1/author> <db:people/Davide_Palmisano> ;
<http://purl.org/dc/terms/name> "Departments" .
Cannot find triple (http://database.org/table/Departments http://database.org/description "Tables listing departments")
That error message seems to indicate that the internal Sesame repository did not receive the namespace declaration to map "db:" to "http://database.org/". That will need to be tested at the Semargl end of things, however, it may also be an error on our end if we are using a custom RDFHandler that doesn't react properly to RDFHandler.handleNamespace.
The branch, named ANY23-137, with the parser factory conversion is available in the Apache Git repository and in my GitHub repository if you prefer to fetch it from there.
Hi Peter. I can confirm everything you've said above. Strangely when I work with the ANY23-137 branch and do 'mvn test' locally, it all seems to work fine and no tests fail.
Anyway, our nightly build was picking up the ANY23-137 branch and building it (and failing) so I can confirm everything that you've said above.
I think that the well-formedness exceptions can be easily solved by removing the preceding <!doctype html> (or maybe it is the license headers) from test-resources html files.
The remaining stuff we need to look into so I'll make this my priority. Would be great to get this into 0.9.0.
What's the status of this issue to merge semargl into Any23? Is there anything I can test to make progress?
I cloned https://github.com/apache/any23/, checked out branch ANY23-137, and rebased on master. I ran mvn clean install, and got:
[INFO] Apache Any23 ...................................... SUCCESS [12.224s]
[INFO] Apache Any23 :: Base API .......................... SUCCESS [2.735s]
[INFO] Apache Any23 :: Test Resources .................... SUCCESS [1.953s]
[INFO] Apache Any23 :: NQuads Parser and Writer .......... SUCCESS [2.180s]
[INFO] Apache Any23 :: CSV Utilities ..................... SUCCESS [0.478s]
[INFO] Apache Any23 :: Mime Type Detection ............... SUCCESS [3.126s]
[INFO] Apache Any23 :: Encoding Detection ................ SUCCESS [1.310s]
[INFO] Apache Any23 :: Core .............................. FAILURE [17.546s]
[INFO] Apache Any23 :: Plugins :: Basic Crawler .......... SKIPPED
[INFO] Apache Any23 :: Plugins :: HTML Scraper ........... SKIPPED
[INFO] Apache Any23 :: Plugins :: Office Scraper ......... SKIPPED
[INFO] Apache Any23 :: Plugins :: Integration Test ....... SKIPPED
[INFO] Apache Any23 :: Service ........................... SKIPPED
...
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.16:test (default-test) on project apache-any23-core: There are test failures.
mvn clean install on master works well locally.
Hi scor, AFAIK this issue is not far from being suitable for integration in to the codebase. There is a ANY23-137 branch which you can checkout and work on. As p_ansell noted above, there are some test failures which need to be addressed before we can merge in to trunk. If you are able to work on some of these it would be excellent.
Hi!
I've prepared pull request to fix RDFa related tests. See https://github.com/apache/any23/pull/2 . Are there any other things to do with this branch?
levkhomich I've tried your branch and running "mvn clean install" is still giving me some errors (though they are different this time):
Failed tests: Any23Test.testExtractionParameters:347 Unexpected number of triples. expected:<6> but was:<9> RoverTest.testRunMultiURLs:103->runWithMultiSourcesAndVerify:118 Unexpected exit code. expected:<0> but was:<1> RoverTest.testRunMultiFiles:64->runWithMultiSourcesAndVerify:118 Unexpected exit code. expected:<0> but was:<1> Tests in error: Any23Test.testImplicitEncoding:135->assertEncodingDetection:621 ? Extraction E... Any23Test.testMicrodataSupport:480->assertExtractorActivation:586->detectAndExtract:555 ? Extraction Any23Test.testExplicitEncoding:118->assertEncodingDetection:621 ? Extraction E... Any23Test.testProgrammaticExtraction:279 ? Extraction Error while parsing RDF ... Tests run: 389, Failures: 3, Errors: 4, Skipped: 9 ... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.16:test (default-test) on project apache-any23-core: There are test failures.
I should also add that I'm running all the above on Mac OS X 10.8.5 with:
$ java -version java version "1.6.0_65" Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609) Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode)
and
$ mvn --version Apache Maven 3.0.3 (r1075438; 2011-02-28 12:31:09-0500) Maven home: /usr/share/maven Java version: 1.6.0_65, vendor: Apple Inc. Java home: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home Default locale: en_US, platform encoding: MacRoman OS name: "mac os x", version: "10.8.5", arch: "x86_64", family: "mac"
Success. I tested https://github.com/apache/any23/pull/2 on ubuntu 13.04 and it worked with java 1.7.0_51 and mvn 3.0.4! Not sure why it's not working on Mac OS X, do I need java 7?
Thanks, Stephane!
Completely missed that RDFa was used as a part of extraction process in other tests.
I've added related fixes.
Brief description.
ServletTest
Old RDFa implementation produces
<issue level="Warning" row="14" col="5">Error while processing node /HTML(1)/HEAD(1)/META(9) : 'Cannot map prefix 'fb''</issue>
while <fb:app_id> is completely valid predicate which shouldn't be resolved against fb: prefix.
Any23Test
RoverTest
Changed RDFXMLWriter to NTriplesWriter in some tests to improve precision (they basically check line count).
Changed expected triples count. It was reduced in most cases, because old RDFa parsed produced a lot of invalid triples like:
<http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/ambiente/> .
<http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/salute/> .
<http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/legalita/> .
<http://host.com/service> <http://host.com/serviceexternal> <http://www.ansamed.info/> .
<http://host.com/service> <http://host.com/serviceexternal> <http://host.com/service/web/notizie/regioni/lazio/provinciadiroma/> .
Fixed markup in test-resources/src/test/resources/html/rdfa/ansa_2010-02-26_12645863.html to conform declared XHTML 1.0 Strict.
Fixed RDFa markup in test-resources/src/test/resources/html/encoding-test.html otherwise it shouldn't produce any triples.
Disabled second part of Any23Test.testExtractionParameters. Should it do anything after RDFa parser replacement?
Also, ExtractionException thrown from BaseRDFExtractor is escalated in test suite. It leads to some failed tests in Any23Test. What's the correct behaviour for ANY23 parser in case it gets SAXException?
Hi Lev, I'll check this tomorrow and hopefully we can get it in to the codebase shortly. Thanks folks for keeping this issue alive and kicking
I get as tests pass successfully.
levkhomich, I'll check out both of your final comments tomorrow e.g.
Should it do anything after RDFa parser replacement?
and
What's the correct behaviour for ANY23 parser in case it gets SAXException?
Great work folks
Regarding the 1st question above. It all looks good. The changes in
Any23Test.testExtractionParameters
look only to be aesthetic reformatting as oppose to functional.
I do not think that there is any standard for catching SAXException. In the past (ANY23-115) for example when we discovered that empty spans break extraction of some documents, we decided to simply replace empty spans with a String "null". This way entire page parse and extraction is not lost/failed. I would be supportive of such measure if we occur when we encounter SAXException as well e.g. deal with it but do not fail the entire parse job.
commit c224e2658e6ac7eb1e9a3066dc0a24aeb9e5457f
Merge: 7934f79 4ce8814
Author: Lewis John McGibbney <lewis.j.mcgibbney@jpl.nasa.gov>
Date: Thu May 8 18:59:33 2014 -0700
ANY23-137 RDFa parser implementation proposal
SUCCESS: Integrated in Any23-trunk #991 (See https://builds.apache.org/job/Any23-trunk/991/)
ANY23-137 : Initial replacement of Any23 RDFA with Semargl (p_ansell: rev 9f60d3252fbd39cd6ea7670b43deeff0045d2b18)
- pom.xml
- core/src/test/java/org/apache/any23/extractor/rdfa/XSLTStylesheetTest.java
- core/src/main/java/org/apache/any23/extractor/rdf/RDFParserFactory.java
- core/pom.xml
- core/src/main/java/org/apache/any23/filter/IgnoreAccidentalRDFa.java
- core/src/main/java/org/apache/any23/extractor/rdfa/RDFaExtractor.java
- core/src/test/java/org/apache/any23/Any23Test.java
- core/src/main/java/org/apache/any23/extractor/rdfa/RDFa11Extractor.java
ANY23-137: Fix other test files that use incorrect prefix syntax (p_ansell: rev 5d6873ccabbf4e4666d7ae204dd18cef9df4a535) - test-resources/src/test/resources/html/rdfa/goodrelations-rdfa11.html
- test-resources/src/test/resources/html/rdfa/rel-href.html
As a temporary solution
<repositories>
<repository>
<id>Semargl repository</id>
<url>https://github.com/levkhomich/semargl/raw/master/maven-repo</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.semarglproject</groupId>
<artifactId>semargl-sesame</artifactId>
<version>0.3</version>
</dependency>
</dependencies>