[TIKA-1990] Broken .jpg inline image from .pdf files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.14, 2.0.0
Component/s: parser
Labels:
None

Description

Hello,

I am using tika-server-1.13.jar . I run it like this "java -jar tika-server-1.13.jar --host=localhost --port=9998" . To be able to extract inline images from pdf files I changed "org/apache/tika/parser/pdf/PDFParser.properties" inside a jar. Set "extractInlineImages" to "true". Everything works perfectly except one thing: images from .pdf files that have .jpg extension are extracted broken. Images with .jpeg, .png extension are extracted fine. .jpg from .doc, .docx and .rtf are extracted fine. Problem seems to appear only with .pdf with .jpg images.

There is an example of pdf document in attachment . To extract images I do "curl -T cv.pdf -H "Accept: application/zip" http://localhost:9998/unpack > cv.zip" . Inside cv.zip there is broken image0.jpg .

At the same time if I use pdfbox-app-2.0.1.jar and run "java -jar pdfbox-app-2.0.1.jar ExtractImages cv.pdf" I get correct image cv-1.jpg

Why does it work like this?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

cv.pdf
28/May/16 11:58
157 kB
Kukushkin Alexander
cv-1.jpg
28/May/16 11:58
15 kB
Kukushkin Alexander
image0.jpg
28/May/16 11:58
385 kB
Kukushkin Alexander

Activity

People

Assignee:: Tim Allison

Reporter:: Kukushkin Alexander

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/May/16 11:57

Updated:: 12/Apr/21 12:59

Resolved:: 31/May/16 14:08