[PDFBOX-1216] Arabic / Farsi (Persian) text appear disconnected when PDF is converted to image - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 1.7.0
Component/s: None
Labels:
None

Description

When the PDF file contains Arabic / Farsi text, they appear disconnected when converting pages to image.
Arabic / Farsi letters are connected to each other when written.

Additionally, the error message "Changing font on <?> from <B Lotus> to the default font" appears on the console.
As I tried to debug the issue, it is because PDFBox is looking into the embedded fonts for the "isolated" variation of the character, where the embedded font only includes "connected" variation.
If the embedded font contains the isolated format too, the font is displayed correctly (the warning message doesn't appear for that character), but the character is displayed as the incorrect variation (i.e. isolated instead of connected)

This happens in both 1.6.0 release and the latest trunk code (as of today). I didn't test previous versions.
The difference is that in 1.6.0, the default font (that is substituted as mentioned above) contains the Arabic / Persian characters, but in the trunk, the replaced characters are displayed as squares.

I will attach a PDF as an input for reproducing the issue.

Note: this might be related to issue ~~PDFBOX-1127~~, but that one regards text extraction.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

a.pdf
30/Jan/12 16:09
53 kB
Hamed Iravanchi
PDFBOX1216-a1.png
04/Mar/12 18:43
41 kB
Andreas Lehmkühler

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Hamed Iravanchi

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 30/Jan/12 16:05

Updated:: 29/May/12 16:21

Resolved:: 04/Mar/12 18:48