Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.18
Description
The list of HTML elements used to extract outlinks from (in DOMContentUtils (parse-html) and DOMContentUtils (parse-tika)) needs to be updated/completed to include HTML elements common in HTML5. Cf. a related question on stackoverflow about the <object> element
A (mostly?) up-to-date list of HTML elements could be taken from the extractor of iipc/webarchiv-commons.