[TIKA-1963] Configuring Parsers: "high degree of control over which parsers are or aren't used" does not work - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.12
Fix Version/s: None
Component/s: config
Labels:
None
Environment:

windows, java version "1.8.0_73", 64 bit

Description

Hi everybody!
I'm trying to white-list a particular mime-type for OCR with the following config:

<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>application/pdf</mime-exclude>
      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
    </parser>
    <parser class="org.apache.tika.parser.pdf.PDFParser">
      <mime>application/pdf</mime>
    </parser>
  </parsers>
</properties>

So, the idea is - to enable the Tesseract parser for PDF format only.
But this configuration disables the Tesseract completely.
Is it the expected behaviour or a bug?
Thank you!

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Konstantin Avdeev

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Apr/16 06:54

Updated:: 01/May/16 07:59