Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-207

MS word doc containing tracked changes produces incorrect text

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.3
    • 0.10
    • parser
    • None
    • tika-0.3-standalone.jar

    Description

      Spinoff from this discussion:

      http://n2.nabble.com/getting-text-from-MS-Word-docs-with-tracked-changes...-td2463811.html

      When extracting text from an MS Word doc (2003 format) that has
      unapproved pending changes, the text from both old and new is glommed
      together.

      EG I had a doc that contained text "Field.Index.TOKENIZED", and I
      changed TOKENIZED to ANALYZED with track changes enabled, and
      then when I extract text (using TikaCLI) it produces this:

      Field.Index.TOKENIZEDANALYZED

      So, first, it'd be nice to at least get whitespace inserted between
      old & new text.

      And, second, it'd be great to have an option to control whether it's
      old or new text that's indexed (or at least an option to only see
      "new" text, ie the current document).

      From the discussion above, it seems like POI may expose the
      fine-grained APIs to allow Tika to do this; it's just that Tika's not
      leveraging these APIs for MS Word docs.

      Attachments

        1. TIKA-207.patch
          2 kB
          Curt Arnold
        2. TIKA-207.patch
          2 kB
          Curt Arnold

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            jukkaz Jukka Zitting
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment