Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.0.14
-
None
Description
In the first extracted line of the reduced file, the "accent" (somebody please correct me what that thing is) is separate. On the second line it is at the proper place. Content stream:
BT 1 0 0 1 67.3 756.98 Tm [ (\000\203\000\227\000q) ] TJ 1 0 0 1 77.5 756.98 Tm [ (\000\003) ] TJ 1 0 0 1 67.3 730 Tm [ (\000\203\000\227\000q\000\003) ] TJ ET
The weird thing is that the "\003" is just a space. So when the space is in the string the extraction works, and when it isn't, it doesn't.
Attachments
Attachments
Issue Links
- is related to
-
PDFBOX-4531 Extraction of Arabic PDF has incorrect ordering of normalized ligatures
- Closed