Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-684

Incorrect ordering of compound Arabic glyphs

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

      Description

      Some Arabic PDFs contain compound glyphs for stylistic reasons.
      Such glyphs encode two letters: FI, SI, LI, LJ, LM, etc.

      Before a line gets sent to the bidirectional algorithm, all characters have been sorted into a visual order, except for these pairs. This is because they are handled as one unit and maintain their original (logical) order. The bidi algorithm straightens out most characters, but reverses the glyph pairs.

      To fix this, the output of font.encode() should be examined and reversed on the spot if it contains pairs of Arabic characters. Possibly you need to add a stub method to PDFStreamEngine (in method processEncodedText) that PDFTextStripper can override (in sort mode only).

        Attachments

        1. PDFStreamEngine.patch
          1 kB
          Yigal Dayan
        2. PDFTextStripper.patch
          1 kB
          Yigal Dayan
        3. zzz.after_fix.txt
          17 kB
          Yigal Dayan
        4. zzz.before_fix.txt
          17 kB
          Yigal Dayan
        5. zzz.pdf
          247 kB
          Yigal Dayan

        Issue Links

          Activity

            People

            • Assignee:
              jukkaz Jukka Zitting
              Reporter:
              ydayan Yigal Dayan

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 3h
                3h
                Remaining:
                Remaining Estimate - 3h
                3h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Issue deployment