[TIKA-207] MS word doc containing tracked changes produces incorrect text - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.3
Fix Version/s: 0.10
Component/s: parser
Labels:
None
Environment:

tika-0.3-standalone.jar

Description

Spinoff from this discussion:

http://n2.nabble.com/getting-text-from-MS-Word-docs-with-tracked-changes...-td2463811.html

When extracting text from an MS Word doc (2003 format) that has
unapproved pending changes, the text from both old and new is glommed
together.

EG I had a doc that contained text "Field.Index.TOKENIZED", and I
changed TOKENIZED to ANALYZED with track changes enabled, and
then when I extract text (using TikaCLI) it produces this:

Field.Index.TOKENIZEDANALYZED

So, first, it'd be nice to at least get whitespace inserted between
old & new text.

And, second, it'd be great to have an option to control whether it's
old or new text that's indexed (or at least an option to only see
"new" text, ie the current document).

From the discussion above, it seems like POI may expose the
fine-grained APIs to allow Tika to do this; it's just that Tika's not
leveraging these APIs for MS Word docs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-207.patch
01/Sep/11 19:09
2 kB
Curt Arnold
TIKA-207.patch
01/Sep/11 20:09
2 kB
Curt Arnold

Issue Links

is related to

TIKA-1321 Add experimental SAX/Streaming XWPF/docx extractor

Resolved

relates to

TIKA-2187 Align default behavior of experimental docx parser with that of doc parser in handling delText

Resolved

Activity

People

Assignee:: Jukka Zitting

Reporter:: Michael McCandless

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Mar/09 10:38

Updated:: 28/Feb/18 18:06

Resolved:: 02/Sep/11 10:59