[PDFBOX-1424] Wrong glyph (Persian) is used in extacted text instead of the original glyph (Persian) in PDF file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.7.1
Fix Version/s: 1.8.0
Component/s: Text extraction
Labels:
None
Environment:
Windows XP, Java 1.6.0

Description

Hi
I am very new to PDFBox and I am dealing with Persian PDF files. When I convert Persian PDF files using PDFBox-app, some Persian glyphs like م are displayed wrongly in the extracted text. For example, the word "هستم" in Persian is extracted as "هستن" and "من" in Persian is extracted as "هن". Also, the word "سلام" is extracted as "سالم". By the way, I have tested extracting text from a complete Persian PDF file with multiple pages; the result is disappointing. Actually, it is totally wrong. Please let me know if I should upload an example Persian PDF file.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFBOX1424-persian_test.html
01/Nov/12 17:45
2 kB
Andreas Lehmkühler
persian_test.html
14/Oct/12 20:39
1 kB
Ali Majdzadeh Kohbanani
persian_test.pdf
14/Oct/12 20:39
28 kB
Ali Majdzadeh Kohbanani

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Ali Majdzadeh Kohbanani

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Oct/12 22:48

Updated:: 23/Mar/13 12:56

Resolved:: 01/Nov/12 17:50