Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1990

Broken .jpg inline image from .pdf files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.14, 2.0.0
    • parser
    • None

    Description

      Hello,

      I am using tika-server-1.13.jar . I run it like this "java -jar tika-server-1.13.jar --host=localhost --port=9998" . To be able to extract inline images from pdf files I changed "org/apache/tika/parser/pdf/PDFParser.properties" inside a jar. Set "extractInlineImages" to "true". Everything works perfectly except one thing: images from .pdf files that have .jpg extension are extracted broken. Images with .jpeg, .png extension are extracted fine. .jpg from .doc, .docx and .rtf are extracted fine. Problem seems to appear only with .pdf with .jpg images.

      There is an example of pdf document in attachment . To extract images I do "curl -T cv.pdf -H "Accept: application/zip" http://localhost:9998/unpack > cv.zip" . Inside cv.zip there is broken image0.jpg .

      At the same time if I use pdfbox-app-2.0.1.jar and run "java -jar pdfbox-app-2.0.1.jar ExtractImages cv.pdf" I get correct image cv-1.jpg

      Why does it work like this?

      Attachments

        1. cv.pdf
          157 kB
          Kukushkin Alexander
        2. cv-1.jpg
          15 kB
          Kukushkin Alexander
        3. image0.jpg
          385 kB
          Kukushkin Alexander

        Activity

          People

            tallison Tim Allison
            alexkuk Kukushkin Alexander
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: