Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-684

Incorrect ordering of compound Arabic glyphs

    XMLWordPrintableJSON

    Details

      Description

      Some Arabic PDFs contain compound glyphs for stylistic reasons.
      Such glyphs encode two letters: FI, SI, LI, LJ, LM, etc.

      Before a line gets sent to the bidirectional algorithm, all characters have been sorted into a visual order, except for these pairs. This is because they are handled as one unit and maintain their original (logical) order. The bidi algorithm straightens out most characters, but reverses the glyph pairs.

      To fix this, the output of font.encode() should be examined and reversed on the spot if it contains pairs of Arabic characters. Possibly you need to add a stub method to PDFStreamEngine (in method processEncodedText) that PDFTextStripper can override (in sort mode only).

        Attachments

        1. PDFTextStripper.patch
          1 kB
          Yigal Dayan
        2. PDFStreamEngine.patch
          1 kB
          Yigal Dayan
        3. zzz.after_fix.txt
          17 kB
          Yigal Dayan
        4. zzz.before_fix.txt
          17 kB
          Yigal Dayan
        5. zzz.pdf
          247 kB
          Yigal Dayan

          Issue Links

            Activity

              People

              • Assignee:
                jukkaz Jukka Zitting
                Reporter:
                ydayan Yigal Dayan
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 3h
                  3h
                  Remaining:
                  Remaining Estimate - 3h
                  3h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified