Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Works for Me
-
1.24.1
-
None
-
None
-
None
Description
I m using tika 1.24.1 together with tesseract from docker image apache/tika:1.24-full
The headerĀ X-Tika-PDFocrStrategy: OCR_AND_TEXT occurs the issue
the output from pdf processing is duplicated:
The output from the attached pdf file is:
There is some text [image: image0.jpg] There is some textT here is an image!!
the curl to reproduce:
curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: OCR_AND_TEXT" -T text_and_image.pdf http://localhost:9998/tika