Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8166

Introduce possibility to configure ParseContext in ExtractingRequestHandler/ExtractingDocumentLoader

    XMLWordPrintableJSON

Details

    Description

      Actually there is no possibility to hand over some additional configuration by document extracting with ExtractingRequestHandler/ExtractingDocumentLoader.

      For example I need to put org.apache.tika.parser.pdf.PDFParserConfig with "extractInlineImages" set to true in ParseContext to trigger extraction/OCR recognizing of embedded images from pdf.

      It would be nice to have possibility to configure created ParseContext due xml-config file like TikaConfig does.

      I would suggest to have following:

      solrconfig.xml:
      <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
      <str name="parseContext.config">parseContext.config</str>
      </requestHandler>

      parseContext.config:

      <entries>
      <entry class="org.apache.tika.parser.pdf.PDFParserConfig" value="org.apache.tika.parser.pdf.PDFParserConfig">
      <property name="extractInlineImages" value="true"/>
      </entry>
      </entries>

      Attachments

        1. SOLR-8166.patch
          19 kB
          Uwe Schindler
        2. SOLR-8166.patch
          19 kB
          Uwe Schindler

        Activity

          People

            uschindler Uwe Schindler
            abinet@gmail.com Andriy Binetsky
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: