Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2610

Extend HtmlMapper isDiscardElement method with Attributes parameter

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.17
    • None
    • parser
    • None

    Description

      Currently, if we want to discard HTML elements by attribute value/existence, an example from one of our projects

      <div data-meta-no-index>Some content to be ignored by custom search indexer (Tika parser)</div>
      

      it's required to implement a custom handler with logic very similar to what we have in org.apache.tika.parser.html.HtmlHandler. While it can be easily done by keep using HtmlHandler, but setting an instance of HtmlMapper with (newly added) isDiscardElement(String name, Attributes attributes) method overridden into the ParseContext.

      Attachments

        Activity

          People

            Unassigned Unassigned
            udalovas Aleksei Udalov
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: