Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3332

Embedded files not extracted from PDF files with multilevel EmbeddedFiles tree

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.25
    • Fix Version/s: 1.26
    • Component/s: None
    • Labels:
      None

      Description

      I have come across some portfolio PDFs that have many attachments / embedded files, but Tika is not detecting or extracting them as it does with some other portfolio PDFs. The issue may be that these files have a multilevel EmbeddedFiles name tree that is not being handled properly by PDFBox.

      Here is the EmbeddedFiles structure of one of the PDF portfolios in question. Notice that the root EmbeddedFiles dictionary has a Kids array that only consists of intermediate dictionaries, with the actual Names array being one more level down.

        Attachments

        1. image-2021-03-20-13-36-48-525.png
          51 kB
          Ross Johnson
        2. Screenshot (5).png
          67 kB
          Tim Allison
        3. Screen Shot 2021-03-22 at 10.29.51 AM.png
          244 kB
          Tim Allison

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                rossj Ross Johnson
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: