Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8166

Introduce possibility to configure ParseContext in ExtractingRequestHandler/ExtractingDocumentLoader

    Details

      Description

      Actually there is no possibility to hand over some additional configuration by document extracting with ExtractingRequestHandler/ExtractingDocumentLoader.

      For example I need to put org.apache.tika.parser.pdf.PDFParserConfig with "extractInlineImages" set to true in ParseContext to trigger extraction/OCR recognizing of embedded images from pdf.

      It would be nice to have possibility to configure created ParseContext due xml-config file like TikaConfig does.

      I would suggest to have following:

      solrconfig.xml:
      <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
      <str name="parseContext.config">parseContext.config</str>
      </requestHandler>

      parseContext.config:

      <entries>
      <entry class="org.apache.tika.parser.pdf.PDFParserConfig" value="org.apache.tika.parser.pdf.PDFParserConfig">
      <property name="extractInlineImages" value="true"/>
      </entry>
      </entries>

        Attachments

        1. SOLR-8166.patch
          19 kB
          Uwe Schindler
        2. SOLR-8166.patch
          19 kB
          Uwe Schindler

          Activity

            People

            • Assignee:
              thetaphi Uwe Schindler
              Reporter:
              abinet@gmail.com Andriy Binetsky
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: