Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2623

get embedded resources in PDF/doc files

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Reopened
    • Trivial
    • Resolution: Unresolved
    • None
    • 2.0.0-BETA
    • cli, core, parser
    • None

    Description

      The motivation: support embedded files in PDF, Word's doc/docx, etc.

      according to https://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-files-using-apache-tika, it is possible to recursively parse a document and save its sub-items (e.g. images) in a folder thanks to FileEmbeddedDocumentExtractor. However, the scope of the above class is only in the TikaCLI.

      I think it should be visible to the applications that uses Tika (not only to the CLI)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ohadr Ohad R
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: