Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
The current DefaultHtmlParser uses a hardcoded list of acceptable HTML elements and attributes. While it's easy for the user to copy-and-paste this file for a custom parser it requires some effort to understand how to make the required changes. It's also a one-off effort - this work can't be reused elsewhere.
Given that there's already a dependency on JSoup... a far better solution is to create a parser that accepts a Safelist instead of using a hardcoded list. This Safelist can be validated and used elsewhere, and perhaps more importantly it makes the transition from a jsoup-based solution to a tika-based solution much more transparent.
NOTE: a Safelist is a POJO and NOT limited to just the jsoup parser.
Preliminary implementation
I have a preliminary implementation that's not ready for a POC pull request - yet.
HtmlParserWithSafelist
This parser is a very stripped down copy of the DefaultHtmlParser. It has removed all existing static elements and replaced them with the appropriate calls to Safelist methods.
This parser also includes a few proposed improvements:
- it captures 'unsafe' elements and attributes. This allows developers to finetune their own Safelist implementations
- it adds optional support for the 'data-*' wildcard. This is a HTML5 standard intended to eliminate custom attributes
DefaultHtmlSafelist
The jsoup Safelist already provides a few reference implementations but they don't fit our needs. This class adds two. In addition it adds support for wildcard attributes beyond the "data-*" mentioned earlier.
DEFAULT
This implementation reproduces the existing behavior with a few improvements
- <source> (since it contains an external reference)
- <form> (since "action" can be an embedded script
- <button> and <input> since they have a "formaction" attribute
- all global attributes
- all form_control, mouse, keyboard, and clipboard events
- <body> and all window events
- <head> (just for completelness with <body>)
IIRC the existing elements have added a few new attributes with HTML5 but I haven't addressed tha
HTML5
This implementation adds many new HTML5 tags, with an emphasis on the tags that provide semantic context. E.g., <section>, <article>, <time>, etc.