Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1294

Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6
    • Component/s: None
    • Labels:
      None

      Description

      TIKA-1268 added the capability to extract embedded images as regular embedded resources...a great feature!

      However, for some use cases, it might not be desirable to extract those types of embedded resources. I see two ways of allowing the client to choose whether or not to extract those images:

      1) set a value in the metadata for the extracted images that identifies them as embedded PDXObjectImages vs regular image attachments. The client can then choose not to process embedded resources with a given metadata value.

      2) allow the client to set a parameter in the PDFConfig object.

      My initial proposal is to go with option 2, and I'll attach a patch shortly.

        Attachments

        1. TIKA-1294.patch
          9 kB
          Tim Allison
        2. TIKA-1294v1.patch
          15 kB
          Tim Allison

          Issue Links

            Activity

              People

              • Assignee:
                tallison Tim Allison
                Reporter:
                tallison Tim Allison
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: