[TIKA-2459] Missing text in .doc file (but can be extracted by POI) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.16
Fix Version/s: 1.17
Component/s: None
Labels:
None
Environment:

Windows and Linux

Description

I've got a document whose text can be extracted via org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get extracted by Tika. The 'paragraph one' paragraph is present in the POI extraction output, and is not present in Tika's output.

Tika's output:

Something
One:
Else
Two:
Here
Three:
Four

Paragraph two
Paragraph three
Paragraph four
cc: Somebody
     Somebody else
Something here too

POI's output:

Something
One:    Else
Two:    Here
Three:  Four

Paragraph one

Paragraph two

Paragraph three

Paragraph four


cc: Somebody
     Somebody else


Something here too

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

foo2.doc
05/Sep/17 22:51
25 kB
Dustin Spicuzza

Activity

People

Assignee:: Unassigned

Reporter:: Dustin Spicuzza

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 05/Sep/17 22:51

Updated:: 29/Oct/19 15:51

Resolved:: 08/Sep/17 16:48