Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.8.6, 2.0.0
Description
When the input document of PDFTextStripper is a combination of right-to-left and left-to-right languages, the output characters of one language is reversed.
A sample bilingual pdf document is attached.
PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which is defined as follows: boolean isRtlDominant = rtlCount > ltrCount;
This class clearly count the number of rtl characters and decide if the whole content should be revered or not. It's not true, it must operate on each word, not the whole document.
Attachments
Attachments
Issue Links
- relates to
-
PDFBOX-3643 Improve text extraction for mixed language documents
- Open