Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2036

Deleted Text from Word File Shows Up in Extract

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.13
    • None
    • core
    • Windows, under TikaOnDotNet

    Description

      A .docx file, with "track changes" on, includes deleted text. In this case, there are two overlapping deletions:

      9. [DELETED:This Agreement shall be governed by and construed in accordance with [INSERTED, THEN DELETED:Arizona] New York law] (Intentionally omitted.)

      The text should only include "9. (Intentionally omitted)". However, the output is "9. This Agreement shall be governed and construed in accordance with New York law." So it recognizes "Arizona" as deleted, but not the rest of it.

      Edit: this is worse than I originally thought. ALL deleted text is showing up in text exported from other Word docs. I saw this reported in 2011, and there was supposedly a patch, but apparently it doesn't work, or something else was changed. Is there an option somewhere that provides for the exclusion of deleted text generally?

      Attachments

        Activity

          People

            Unassigned Unassigned
            gullbyrd Steve Gullion
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: