Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
1.0.0, 1.1.0
Description
Some Arabic PDFs contain compound glyphs for stylistic reasons.
Such glyphs encode two letters: FI, SI, LI, LJ, LM, etc.
Before a line gets sent to the bidirectional algorithm, all characters have been sorted into a visual order, except for these pairs. This is because they are handled as one unit and maintain their original (logical) order. The bidi algorithm straightens out most characters, but reverses the glyph pairs.
To fix this, the output of font.encode() should be examined and reversed on the spot if it contains pairs of Arabic characters. Possibly you need to add a stub method to PDFStreamEngine (in method processEncodedText) that PDFTextStripper can override (in sort mode only).
Attachments
Attachments
Issue Links
- is related to
-
PDFBOX-4531 Extraction of Arabic PDF has incorrect ordering of normalized ligatures
- Closed