Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-347

Make HtmlParser customizable through ParseContext

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6
    • Component/s: parser
    • Labels:
      None

      Description

      In TIKA-304 we added the mapSafeElement() and isDiscardElement() methods to HtmlParser so that subclasses could better customize how incoming HTML elements get mapped to the XHMTL output from Tika. This works fairly well but requires you to modify the Tika configuration file or to explicitly inject a custom HtmlParser subclass instance to the CompositeParser instance you're using (AutoDetectParser, etc.).

      Now that we have the ParseContext mechanism available to simplify such customization, it would be nice to allow you to provide a custom "HTML mapper" instance through the parse context and have HtmlParser call that mapper (if available) for the mapSafeElement() and isDiscardElement() operations.

        Attachments

          Activity

            People

            • Assignee:
              jukkaz Jukka Zitting
              Reporter:
              jukkaz Jukka Zitting
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: