Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-4530

DIH: Provide configuration to use Tika's IdentityHtmlMapper

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 4.1
    • 4.3
    • None

    Description

      When using TikaEntityProcessor in DIH, the default HTML Mapper strips out most of the HTML. It may make sense when the expectation is just to store the extracted content as a text blob, but DIH allows more fine-tuned content extraction (e.g. with nested XPathEntityProcessor).

      Recent Tika versions allow to set an alternative HTML Mapper implementation that passes all the HTML in. It would be useful to be able to set that implementation from DIH configuration.

      Attachments

        1. SOLR-4530.patch
          5 kB
          Alexandre Rafalovitch

        Activity

          People

            shalin Shalin Shekhar Mangar
            arafalov Alexandre Rafalovitch
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: