Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-4057

Skip Thumbnails from Metadata When Scanning PPTX files

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.6.0
    • None
    • metadata, mime
    • None

    Description

      I am scanning Pptx using tika parser/core 2.6.0 version and using EmbeddedDocumentExtractor to verify if embedded images are present in pptx or not. It seems that metadata contains thumbnails with mime type as "image/jpeg". The key and value for thumbnail areĀ  "dc:title" and "/docProps/thumbnail.jpeg" respectively. So even if there is no embedded image in pptx file, result always shows "Embedded image present" due to thumbnails. Is there any way to introduce any parameter in officeParserConfig that will skip the thumbnails while parsing . Thanks

      Attachments

        Activity

          People

            Unassigned Unassigned
            23kshitij92 Kshitij
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: