Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5126

Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width join, etc.) in a RTL context get reversed incorrectly on text extraction

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.22
    • None
    • Text extraction
    • None

    Description

      The attached PDF contains old Hungarian runic script, which is both right-to-left and outside Unicode's Basic Multilingual Plane (and thus encoded as surrogate pairs in Java's internal UTF-16-like representation). When this text is extracted, the surrogate pairs are reversed due to an overly naive use of "char"-level reversal, leading to malformed Unicode output.

      Likewise, when combining diacritics/modifiers occur in a right-to-left context, their position relative to the "parent" character is reversed, and so they appear on the wrong glyph, as demonstrated by the Hebrew sample in the same PDF. I imagine the same thing would also happen to emoji using the "zero-width joiner" in an RTL context.

      Attachments

        1. rovasvegyes.pdf
          73 kB
          Gábor Stefanik

        Activity

          People

            Unassigned Unassigned
            Googulator Gábor Stefanik
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: