[PDFBOX-939] Lost whitespaces when extracting Arabic text - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.0
Fix Version/s: 1.5.0
Component/s: Text extraction
Labels:
None

Description

I tried to extract text from an arabic PDF. Result looks good for the first look, but if you look closer, you may notice that some of whitespaces is missing comparing to copy/pasted text from the same PDF.

Copy/pasted line from attached PDF:
بعد ما اكتشف حقيقة المثلث الغامض

Extracted text:
بعد ما اكتشف حقيقةالمثلثالغامض

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

test.pdf
11/Jan/11 15:12
28 kB
Anton Stremoukhov
extracted.txt
11/Jan/11 15:13
0.4 kB
Anton Stremoukhov

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Anton Stremoukhov

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 11/Jan/11 15:11

Updated:: 04/Mar/11 10:29

Resolved:: 25/Jan/11 07:18