Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-207

MS word doc containing tracked changes produces incorrect text

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.3
    • 0.10
    • parser
    • None
    • tika-0.3-standalone.jar

    Description

      Spinoff from this discussion:

      http://n2.nabble.com/getting-text-from-MS-Word-docs-with-tracked-changes...-td2463811.html

      When extracting text from an MS Word doc (2003 format) that has
      unapproved pending changes, the text from both old and new is glommed
      together.

      EG I had a doc that contained text "Field.Index.TOKENIZED", and I
      changed TOKENIZED to ANALYZED with track changes enabled, and
      then when I extract text (using TikaCLI) it produces this:

      Field.Index.TOKENIZEDANALYZED

      So, first, it'd be nice to at least get whitespace inserted between
      old & new text.

      And, second, it'd be great to have an option to control whether it's
      old or new text that's indexed (or at least an option to only see
      "new" text, ie the current document).

      From the discussion above, it seems like POI may expose the
      fine-grained APIs to allow Tika to do this; it's just that Tika's not
      leveraging these APIs for MS Word docs.

      Attachments

        1. TIKA-207.patch
          2 kB
          Curt Arnold
        2. TIKA-207.patch
          2 kB
          Curt Arnold

        Issue Links

          Activity

            People

              jukkaz Jukka Zitting
              mikemccand Michael McCandless
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: