Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.16
-
None
-
None
-
Windows and Linux
Description
I've got a document whose text can be extracted via org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get extracted by Tika. The 'paragraph one' paragraph is present in the POI extraction output, and is not present in Tika's output.
Tika's output:
Something One: Else Two: Here Three: Four Paragraph two Paragraph three Paragraph four cc: Somebody Somebody else Something here too
POI's output:
Something One: Else Two: Here Three: Four Paragraph one Paragraph two Paragraph three Paragraph four cc: Somebody Somebody else Something here too