Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-4530

DIH: Provide configuration to use Tika's IdentityHtmlMapper

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 4.1
    • Fix Version/s: 4.3
    • Labels:
      None

      Description

      When using TikaEntityProcessor in DIH, the default HTML Mapper strips out most of the HTML. It may make sense when the expectation is just to store the extracted content as a text blob, but DIH allows more fine-tuned content extraction (e.g. with nested XPathEntityProcessor).

      Recent Tika versions allow to set an alternative HTML Mapper implementation that passes all the HTML in. It would be useful to be able to set that implementation from DIH configuration.

        Attachments

        1. SOLR-4530.patch
          5 kB
          Alexandre Rafalovitch

          Activity

            People

            • Assignee:
              shalinmangar Shalin Shekhar Mangar
              Reporter:
              arafalov Alexandre Rafalovitch
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: