Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-4353

Implement HtmlParserWithSafelist that uses a standard jsoup Safelist for filtering.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      The current DefaultHtmlParser uses a hardcoded list of acceptable HTML elements and attributes. While it's easy for the user to copy-and-paste this file for a custom parser it requires some effort to understand how to make the required changes. It's also a one-off effort - this work can't be reused elsewhere.

      Given that there's already a dependency on JSoup... a far better solution is to create a parser that accepts a Safelist instead of using a hardcoded list. This Safelist can be validated and used elsewhere, and perhaps more importantly it makes the transition from a jsoup-based solution to a tika-based solution much more transparent.

      NOTE: a Safelist is a POJO and NOT limited to just the jsoup parser.

      Preliminary implementation

      I have a preliminary implementation that's not ready for a POC pull request - yet.

      HtmlParserWithSafelist

      This parser is a very stripped down copy of the DefaultHtmlParser. It has removed all existing static elements and replaced them with the appropriate calls to Safelist methods.

      This parser also includes a few proposed improvements:

      • it captures 'unsafe' elements and attributes. This allows developers to finetune their own Safelist implementations
      • it adds optional support for the 'data-*' wildcard.  This is a HTML5 standard intended to eliminate custom attributes

      DefaultHtmlSafelist

      The jsoup Safelist already provides a few reference implementations but they don't fit our needs.  This class adds two. In addition it adds support for wildcard attributes beyond the "data-*" mentioned earlier.

      DEFAULT

      This implementation reproduces the existing behavior with a few improvements

      • <source> (since it contains an external reference)
      • <form> (since "action" can be an embedded script
      • <button> and <input> since they have a "formaction" attribute
      • all global attributes
      • all form_control, mouse, keyboard, and clipboard events
      • <body> and all window events
      • <head> (just for completelness with <body>)

      IIRC the existing elements have added a few new attributes with HTML5 but I haven't addressed tha

      HTML5

      This implementation adds many  new HTML5 tags, with an emphasis on the tags that provide semantic context. E.g., <section>, <article>, <time>, etc.

      Attachments

        Activity

          People

            Unassigned Unassigned
            bgiles Bear R Giles
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: