[TIKA-1130] .docx text extract leaves out some portions of text - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.2, 1.3
Fix Version/s: 1.5
Component/s: parser
Labels:
None
Environment:

OpenJDK x86_64

Description

When parsing a Microsoft Word .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document), certain portions of text remain unextracted.

I have attached a .docx file that can be tested against. The 'gray' portions of text are what are not extracted, while the darker colored text extracts fine.

Looking at the document.xml portion of the .docx zip file shows the text is all there.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Resume 6.4.13.docx
05/Jun/13 21:58
125 kB
Daniel Gibby
TIKA-1130.patch
25/Jun/13 02:17
11 kB
Tim Allison
TIKA-1130.patch
25/Jun/13 23:49
11 kB
Tim Allison
tee internal resme.docx
10/Jul/13 15:57
39 kB
Daniel Gibby
OwenResume.docx
10/Jul/13 16:00
45 kB
Daniel Gibby

Activity

People

Assignee:: Unassigned

Reporter:: Daniel Gibby

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 05/Jun/13 21:56

Updated:: 25/Mar/14 16:21

Resolved:: 10/Jul/13 17:11