Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
1.11
-
None
-
None
Description
XHTMLContentHandler has some logic that closes the head section too early, or this is a problem in TagSoup. In this [1] case a <div> element appears in the head, causing the head to be closed. Subsequent <head> elements do not appear in custom ContentHandlers so i cannot read the document's title, or any other meta tags.
It can be fixed by using a custom HTMLSchema in the ParseContext, e.g. schema.elementType("div", HTMLSchema.M_EMPTY, 65535, 0); but this isn't really an elegant solution.
[1] http://www.aljazeera.com/news/2015/05/150516182251747.html
Attachments
Issue Links
- is superceded by
-
TIKA-1599 Switch from TagSoup to JSoup
- Resolved