[PDFBOX-5126] Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width join, etc.) in a RTL context get reversed incorrectly on text extraction - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.22
Fix Version/s: None
Component/s: Text extraction
Labels:
None

Description

The attached PDF contains old Hungarian runic script, which is both right-to-left and outside Unicode's Basic Multilingual Plane (and thus encoded as surrogate pairs in Java's internal UTF-16-like representation). When this text is extracted, the surrogate pairs are reversed due to an overly naive use of "char"-level reversal, leading to malformed Unicode output.

Likewise, when combining diacritics/modifiers occur in a right-to-left context, their position relative to the "parent" character is reversed, and so they appear on the wrong glyph, as demonstrated by the Hebrew sample in the same PDF. I imagine the same thing would also happen to emoji using the "zero-width joiner" in an RTL context.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

rovasvegyes.pdf
08/Mar/21 22:52
73 kB
Gábor Stefanik

Activity

People

Assignee:: Unassigned

Reporter:: Gábor Stefanik

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Mar/21 22:57

Updated:: 09/Mar/21 12:31