Description
Hello,
I am using tika-server-1.13.jar . I run it like this "java -jar tika-server-1.13.jar --host=localhost --port=9998" . To be able to extract inline images from pdf files I changed "org/apache/tika/parser/pdf/PDFParser.properties" inside a jar. Set "extractInlineImages" to "true". Everything works perfectly except one thing: images from .pdf files that have .jpg extension are extracted broken. Images with .jpeg, .png extension are extracted fine. .jpg from .doc, .docx and .rtf are extracted fine. Problem seems to appear only with .pdf with .jpg images.
There is an example of pdf document in attachment . To extract images I do "curl -T cv.pdf -H "Accept: application/zip" http://localhost:9998/unpack > cv.zip" . Inside cv.zip there is broken image0.jpg .
At the same time if I use pdfbox-app-2.0.1.jar and run "java -jar pdfbox-app-2.0.1.jar ExtractImages cv.pdf" I get correct image cv-1.jpg
Why does it work like this?