[ANY23-340] Any23 extraction does not pass Nutch plugin test - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.2
Fix Version/s: 2.3
Component/s: extractors
Labels:
None

Description

When removing the SAX parsing filter from the Nutch Any23 plugin, the test case fails.

Cf. this pull request: https://github.com/apache/nutch/pull/306

There are two test files: (1) microdata_basic.html, and (2) BBC_News_Scotland.html.

For (1), the test case expects 39 triples to be extracted. With the SAX pre-filter, 39 triples are extracted. Without the SAX pre-filter, only 38 triples are extracted.

The bad news is, BOTH OF THESE NUMBERS ARE WRONG. 40 triples should be extracted.

Without the SAX pre-filter, the html-microdata extractor loses 2 triples to ~~ANY23-339~~, bringing the total to 38.

With the SAX pre-filter, it sees the meta element in the following code:

<span itemscope><meta itemprop="name" content="The Castle"></span>

And tries to wrap it in a head element:

<span itemscope="itemscope"></span>
</body><head><meta itemprop="name" content="The Castle"></meta></head><body>

Which the Jsoup pre-filter then throws out, as it should:

<span itemscope="itemscope"></span>
<meta itemprop="name" content="The Castle" />

leaving us with an item not wrapped in an itemscope (-2 triples) (but would be -2 anyway due to ~~ANY23-339~~) and an EMPTY item scope (+1 triples), bringing the total to 39.

The extraction fails (2) by failing to extract a total of 11 triples, all of which have a predicate IRI equal to "http://www.w3.org/1999/xhtml/vocab#role".

Of those 11 triples, 1 triple has the object IRI "http://www.w3.org/1999/xhtml/vocab#navigation", 1 triple has the object IRI "http://www.w3.org/1999/xhtml/vocab#search", 1 triple has the object IRI "http://www.w3.org/1999/xhtml/vocab#contentinfo", and 8 triples have the object IRI "http://www.w3.org/1999/xhtml/vocab#presentation".

All of these triples are being overlooked by the html-rdfa11 extractor.

The reason they are being overlooked is, apparently, because the document type definition of the document specifies XHTML+RDFa version 1.0:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">

When I either change the document type to XHTML+RDFa version 1.1:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">

or remove the doctype altogether, all 11 triples are extracted as expected.

So, this would be easily fixed just by removing doctypes from all documents.

Comments or insight anyone?

Question: does anyone know whether or not the rdfa version 1.0 triples extracted from a page are guaranteed to be a subset of the rdfa version 1.1 triples extracted?

Attachments

Issue Links

is blocked by

ANY23-339 Microdata extractor can sometime merge two different itemscopes into one

Resolved

links to

GitHub Pull Request #68

Activity

People

Assignee:: Hans Brende

Reporter:: Hans Brende

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 30/Mar/18 06:13

Updated:: 02/Apr/18 18:28

Resolved:: 02/Apr/18 17:27