Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-956

Embedded docs in Word doc are not inlined (text is always added to the end)

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.2
    • 1.3
    • parser
    • None

    Description

      You can see this with the recently added testWORD_embedded_pdf.doc
      (for TIKA-948): the "Bye Bye" text comes before the "Wer
      wjelrwoierj..." text from the embedded PDF, opposite of what you see
      when you open the doc with Word.

      Yet, the thumbnail images do seem to be extracted at the right place
      (inlined).

      This is because WordExtractor.java has a separate pass at the end to
      visit the embedded docs.

      Would it be possible to recurse into an embedded doc at the point when
      it's first encountered instead...? Or maybe somehow correlate the
      images with their corresponding attachment (right now they are just
      named image1, image2, ...)?

      Attachments

        1. TIKA-956.patch
          2 kB
          Michael McCandless
        2. TIKA-956.patch
          4 kB
          Michael McCandless
        3. TIKA-956.patch
          4 kB
          Michael McCandless

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            mikemccand Michael McCandless
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment