Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1297

Images not being extracted from PDFs

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5
    • Fix Version/s: 1.6
    • Component/s: parser
    • Labels:
      None

      Description

      Images embedded within PDF documents are not being extracted by Tika. I have tested this via the command line (where the -z option fails to extract any images), and by inspecting the XHTML version of the PDF produced by Tika (where the image tags are not included in the output).

      The images are extractable by PDFBox, so Tika should be able to extract them and include them in the XHTML output.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              james.d.baker James Baker
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: