[TIKA-3202] Tika duplicates the ocr text - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Works for Me
Affects Version/s: 1.24.1
Fix Version/s: None
Component/s: None
Labels:
None

Description

I m using tika 1.24.1 together with tesseract from docker image apache/tika:1.24-full

The header X-Tika-PDFocrStrategy: OCR_AND_TEXT occurs the issue

the output from pdf processing is duplicated:
The output from the attached pdf file is:

There is some text 
[image: image0.jpg]

There is some textT
here is an image!!

the curl to reproduce:

curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: OCR_AND_TEXT" -T text_and_image.pdf  http://localhost:9998/tika

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

text_and_image.pdf
22/Sep/20 21:24
42 kB
marek kapowicki

Activity

People

Assignee:: Unassigned

Reporter:: marek kapowicki

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Sep/20 21:27

Updated:: 23/Sep/20 05:54

Resolved:: 23/Sep/20 05:54