[PDFBOX-684] Incorrect ordering of compound Arabic glyphs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.0.0, 1.1.0
Fix Version/s: 1.2.0
Component/s: Text extraction
Labels:
- arabic
- bidirectional
- compound
- glyph
- reversed

Description

Some Arabic PDFs contain compound glyphs for stylistic reasons.
Such glyphs encode two letters: FI, SI, LI, LJ, LM, etc.

Before a line gets sent to the bidirectional algorithm, all characters have been sorted into a visual order, except for these pairs. This is because they are handled as one unit and maintain their original (logical) order. The bidi algorithm straightens out most characters, but reverses the glyph pairs.

To fix this, the output of font.encode() should be examined and reversed on the spot if it contains pairs of Arabic characters. Possibly you need to add a stub method to PDFStreamEngine (in method processEncodedText) that PDFTextStripper can override (in sort mode only).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFStreamEngine.patch
09/Jun/10 16:39
1 kB
Yigal Dayan
PDFTextStripper.patch
09/Jun/10 16:39
1 kB
Yigal Dayan
zzz.after_fix.txt
08/Apr/10 08:07
17 kB
Yigal Dayan
zzz.before_fix.txt
08/Apr/10 08:07
17 kB
Yigal Dayan
zzz.pdf
08/Apr/10 08:07
247 kB
Yigal Dayan

Issue Links

is related to

PDFBOX-4531 Extraction of Arabic PDF has incorrect ordering of normalized ligatures

Closed

Activity

People

Assignee:: Jukka Zitting

Reporter:: Yigal Dayan

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 08/Apr/10 08:04

Updated:: 30/Apr/19 09:05

Resolved:: 21/Jun/10 15:38

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified